GNU bug report logs -
#17376
[PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Wed, 30 Apr 2014 15:03:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
On 04/30/2014 08:02 AM, Norihiro Tanaka wrote:
> Thare is different behaviour for a invalid sequence between KWset and DFA.
>
> encode() { echo "$1" | tr ABC '\357\274\241'; }
> encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
> encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
> encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
> encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
>
> We expect that all of them are same results, but only 4th returns 1 row.
Sorry, but I am not observing this behavior. With grep 2.18, none of
the commands output anything. The same is true for the git master.
If the pattern or data have encoding errors, POSIX says grep can do
whatever it likes. As I understand it, in grep 2.18 and the git master,
an encoding-error byte in a pattern matches only the same encoding-error
byte in the data. Does this bug report's patch change behavior, so that
an encoding-error byte in a pattern can match part of a valid
multibyte-character in the data? If so, it's not clear to me why the
proposed behavior change is helpful -- as a user, I'm not sure I'd want
such a match to work. If not, then could you please explain a bit more
what's going on?
More generally, I don't think users care about encoding-error bytes in
patterns. If it helps simplify the code and/or improves performance,
I'd favor changing 'grep' so that it simply rejects patterns containing
encoding errors, and exits with status 2. We should probably wait until
after the next release before doing anything that drastic, though.
This bug report was last modified 11 years and 15 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.