#38656 - [PATCH] grep: do not match invalid UTF-8

GNU bug report logs - #38656
[PATCH] grep: do not match invalid UTF-8

Package: grep;

Reported by: Paul Eggert <eggert <at> cs.ucla.edu>

Date: Wed, 18 Dec 2019 06:06:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #10 received at 38656 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Bruno Haible <bruno <at> clisp.org> Cc: bug-gnulib <at> gnu.org, 38656 <at> debbugs.gnu.org Subject: Re: [PATCH 4/4] dfa: do not match invalid UTF-8 Date: Wed, 18 Dec 2019 09:06:30 -0800

On 12/18/19 12:48 AM, Bruno Haible wrote re my recent Gnulib change <https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=1219c343014ede881069bab554408b40e5455d9c>, with corresponding Grep change <https://git.savannah.gnu.org/cgit/grep.git/commit/?id=c9a6e4bf919e1b28970e11b29aa720a7e6144834>: > Do I understand it correctly that, as a consequence of this change, > 'grep' with a regex of '^.*$' will no longer match lines which contains > an invalid UTF-8 byte sequence? Yes and no. dfa.c's '^.*$' already rejected some lines with invalid UTF-8 byte sequences. The change merely made dfa.c reject all such lines. > - Is this effect on 'grep' intended? (And the workaround is to use the > "C" locale.) Yes. > - Is it consistent with the behaviour of regex and kwset, which 'grep' > also uses, depending on the arguments (as far as I understand)? No, in the sense that the matchers disagree about what to do with encoding errors. I think regex '.' matches the first byte of an encoding error (which would be hard to mimic in that part of dfa.c as this behavior requires lookahead). I don't know what kwset does. In some sense it doesn't matter, as neither POSIX nor the grep manual say what to do when the pattern or input contains encoding errors. I installed the patch because it seemed "wrong" to me that the "." pattern matched an invalid byte sequence of length 2 or more, with no characters in sight. Conversely, I suppose if the change significantly hurts performance, then it should be reverted (but with a comment explaining why dfa.c accepts more than just the valid UTF-8 byte sequences) or perhaps redone in a better way. I am cc'ing this to 38656 <at> debbugs.gnu.org to give 'grep' lurkers a heads-up about this.

This bug report was last modified 5 years and 217 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #38656 [PATCH] grep: do not match invalid UTF-8

GNU bug report logs - #38656
[PATCH] grep: do not match invalid UTF-8