GNU bug report logs - #18266
grep -P and invalid exits with error

Previous Next

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Thu, 14 Aug 2014 15:43:02 UTC

Severity: wishlist

Merged with 18455

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>, 758105 <at> bugs.debian.org
Subject: bug#18266: grep -P and invalid exits with error
Date: Mon, 01 Sep 2014 01:31:53 -0700
Vincent Lefevre wrote:

>        [...] Note that this option can also be passed to pcre_exec()
>        and pcre_dfa_exec(), to suppress the validity checking of
>        subject strings only. If the same string is being matched
>        many times, the option can be safely set for the second and
>        subsequent matchings to improve performance.
>
> The last sentence would imply that the UTF8 checking is done on the
> whole input buffer before matching is done.

That's pretty subtle, and perhaps too subtle.  A plausible 
interpretation of the phrase "same string is being matched" is that 
libpcre checks only the matched string, and that bytes after the match 
(which did not need to be examined to do the match) are not checked. 
Can you confirm with the libpcre authors that this plausible 
interpretation is incorrect, i.e., that the entire input string is 
checked, even the unmatched part?  If that's what is intended, the 
documentation should state so clearly, so at least there's a 
documentation bug there.

> If there are many invalid UTF8 bytes, this would be slow, IMHO

That's OK.  We don't need grep -P to be fast on invalid input.

> But is the copy of the buffer really needed? Couldn't the invalid
> UTF8 sequences just be replaced by null bytes?

I'd rather not, because that changes the semantics of matching.  The 
null byte is valid input data that might get matched.

> in case of invalid UTF8 bytes, in some (many?) cases, the
> cause is a binary file (possibly with some text in it), where lines
> can be very long. So, wouldn't it mean that it can take significantly
> more memory?

Sure.  But that's the same for -P as it is for plain grep.




This bug report was last modified 10 years and 248 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.