GNU bug report logs - #18266
grep -P and invalid exits with error

Previous Next

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Thu, 14 Aug 2014 15:43:02 UTC

Severity: wishlist

Merged with 18455

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Santiago <santiago <at> debian.org>
Cc: 758105 <at> bugs.debian.org, Vincent Lefevre <vincent <at> vinc17.net>, 18266 <at> debbugs.gnu.org
Subject: bug#18266: grep -P and invalid exits with error
Date: Tue, 09 Sep 2014 12:59:27 -0700
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> I'm worried that to re-run for invalid UTF-8 makes slowness for searching
> of the large number of binary files.

Yes, that could be a problem, but even so it's better for grep to report 
matches than to give up and fail.  Perhaps someone could optimize this 
better later, but to be honest given how flaky libpcre is we're probably 
better off spending our scarce development resources elsewhere.

Santiago's latest patch still had some troubles, unfortunately.  It 
could mishandle '^' by having it match just past an encoding error.  It 
was less efficient than it could be, as it checked all valid bytes for 
UTF-8-edness twice.  If I understand PCRE correctly (which quite 
possibly I don't), it also appeared to mishandle matches that contain 
nested subexpressions.  But the worst part was that the code was too 
complicated (and this was true even before Santiago's patch was 
applied).  So I rewrote it and installed the attached patch instead. 
Please give it a try.
[0001-grep-P-now-treats-invalid-UTF-8-input-as-non-matchin.patch (text/plain, attachment)]

This bug report was last modified 10 years and 248 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.