GNU bug report logs - #17376
[PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Wed, 30 Apr 2014 15:03:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17376 <at> debbugs.gnu.org
Subject: bug#17376: [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA
Date: Wed, 30 Apr 2014 12:04:50 -0700
On 04/30/2014 08:02 AM, Norihiro Tanaka wrote:
> Thare is different behaviour for a invalid sequence between KWset and DFA.
>
>    encode() { echo "$1" | tr ABC '\357\274\241'; }
>    encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
>    encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
>    encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
>    encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
>
> We expect that all of them are same results, but only 4th returns 1 row.

Sorry, but I am not observing this behavior.  With grep 2.18, none of 
the commands output anything.  The same is true for the git master.

If the pattern or data have encoding errors, POSIX says grep can do 
whatever it likes.  As I understand it, in grep 2.18 and the git master, 
an encoding-error byte in a pattern matches only the same encoding-error 
byte in the data.  Does this bug report's patch change behavior, so that 
an encoding-error byte in a pattern can match part of a valid 
multibyte-character in the data?  If so, it's not clear to me why the 
proposed behavior change is helpful -- as a user, I'm not sure I'd want 
such a match to work.  If not, then could you please explain a bit more 
what's going on?

More generally, I don't think users care about encoding-error bytes in 
patterns.  If it helps simplify the code and/or improves performance, 
I'd favor changing 'grep' so that it simply rejects patterns containing 
encoding errors, and exits with status 2.  We should probably wait until 
after the next release before doing anything that drastic, though.




This bug report was last modified 11 years and 15 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.