GNU bug report logs - #18266
grep -P and invalid exits with error

Previous Next

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Thu, 14 Aug 2014 15:43:02 UTC

Severity: wishlist

Merged with 18455

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>, 758105 <at> bugs.debian.org
Subject: bug#18266: handling bytes not part of the charset, and other garbage (was: grep -P and invalid exits with error)
Date: Thu, 11 Sep 2014 13:07:00 +0200
On 2014-09-01 01:31:53 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >If there are many invalid UTF8 bytes, this would be slow, IMHO
> 
> That's OK.  We don't need grep -P to be fast on invalid input.

I can see a too important slowdown in practical cases.

> >But is the copy of the buffer really needed? Couldn't the invalid
> >UTF8 sequences just be replaced by null bytes?
> 
> I'd rather not, because that changes the semantics of matching.  The null
> byte is valid input data that might get matched.

It appears that the current behavior in UTF-8 is incorrect, even
without -P. For instance:

$ printf 'tr\xe8s\n' > text
$ grep 'tr.s' text
$ LC_ALL=C grep 'tr.s' text
tr<E8>s

There's no reason that '.' matches something that doesn't belong to
the charset in C locale, but doesn't match in a UTF-8 locale.

The pattern tr.s is used here to match the French word "très" in files
that could be encoded in ISO-8859-1 or UTF-8 locales. In the past,
before using UTF-8 locales, I was doing something like:

  grep -E 'tr..?s' text

to match both encodings, and this worked (I could get false positives,
but anyway, one is often not interested in all the real grep matches
in practice, so that even when knowing the encoding, one was already
getting false positives). It's annoying that now in UTF-8, one can no
longer match ISO-8859-1 text, and doing a pre-conversion would take
too much time.

Concerning binary files, I've never wanted to differentiate explicitly
null bytes and invalid UTF-8 sequences: IMHO, this is just garbage.
There are obviously no differences with patterns like 'some_word' or
'foo[0-9]*bar', but when I use a pattern like 'foo.bar' or 'foo.*bar',
I can see two valid reasons to handle these sequences in a similar
way with '.':

1. One may want to match "valid" (often in the sense "printable", in
the specified encoding) but unknown characters.

2. One may also want to match garbage (including null bytes, and also
bytes that do not have any meaning in the charset), with the drawback
that if the garbage contains a newline character, this won't work.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




This bug report was last modified 10 years and 249 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.