#18266 - grep -P and invalid exits with error

GNU bug report logs - #18266
grep -P and invalid exits with error

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Thu, 14 Aug 2014 15:43:02 UTC

Severity: wishlist

Merged with 18455

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Vincent Lefevre <vincent <at> vinc17.net> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>, 758105 <at> bugs.debian.org Subject: bug#18266: handling bytes not part of the charset, and other garbage (was: grep -P and invalid exits with error) Date: Thu, 11 Sep 2014 13:07:00 +0200

On 2014-09-01 01:31:53 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >If there are many invalid UTF8 bytes, this would be slow, IMHO > > That's OK. We don't need grep -P to be fast on invalid input. I can see a too important slowdown in practical cases. > >But is the copy of the buffer really needed? Couldn't the invalid > >UTF8 sequences just be replaced by null bytes? > > I'd rather not, because that changes the semantics of matching. The null > byte is valid input data that might get matched. It appears that the current behavior in UTF-8 is incorrect, even without -P. For instance: $ printf 'tr\xe8s\n' > text $ grep 'tr.s' text $ LC_ALL=C grep 'tr.s' text tr<E8>s There's no reason that '.' matches something that doesn't belong to the charset in C locale, but doesn't match in a UTF-8 locale. The pattern tr.s is used here to match the French word "très" in files that could be encoded in ISO-8859-1 or UTF-8 locales. In the past, before using UTF-8 locales, I was doing something like: grep -E 'tr..?s' text to match both encodings, and this worked (I could get false positives, but anyway, one is often not interested in all the real grep matches in practice, so that even when knowing the encoding, one was already getting false positives). It's annoying that now in UTF-8, one can no longer match ISO-8859-1 text, and doing a pre-conversion would take too much time. Concerning binary files, I've never wanted to differentiate explicitly null bytes and invalid UTF-8 sequences: IMHO, this is just garbage. There are obviously no differences with patterns like 'some_word' or 'foo[0-9]*bar', but when I use a pattern like 'foo.bar' or 'foo.*bar', I can see two valid reasons to handle these sequences in a similar way with '.': 1. One may want to match "valid" (often in the sense "printable", in the specified encoding) but unknown characters. 2. One may also want to match garbage (including null bytes, and also bytes that do not have any meaning in the charset), with the drawback that if the garbage contains a newline character, this won't work. -- Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

This bug report was last modified 10 years and 300 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18266 grep -P and invalid exits with error

GNU bug report logs - #18266
grep -P and invalid exits with error