GNU bug report logs - #20526
BUG: text file is detected as binary

Previous Next

Package: grep;

Reported by: Sebastian Poehn <sebastian.poehn <at> gmail.com>

Date: Thu, 7 May 2015 15:41:03 UTC

Severity: normal

Merged with 19230, 19985, 21558

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>, Kamil Dudka <kdudka <at> redhat.com>
Cc: 20526 <at> debbugs.gnu.org, sebastian.poehn <at> gmail.com, =?UTF-8?Q?P=C3=B6hn <at> debbugs.gnu.org
Subject: bug#20526: BUG: text file is detected as binary
Date: Tue, 12 May 2015 17:08:42 -0700
Eric Blake wrote:
> I'm still a bit worried that encoding errors encountered on input, even
> though they don't match for output, may still cause issues for some
> patterns (we've had cases of encoding errors causing 'grep -P' to go
> into an infinite loop, for example);

Yes, that's right.  We can't go back to the old way of doing things.  Encoding 
errors in the data must not be matched by any regular expression (not even "."). 
 'grep -P' won't loop if we never pass encoding errors to the PCRE matcher, so 
that's what we gotta do.

> but yes, as the behavior is
> undefined, we are still justified in adopting those heuristics, if
> someone is willing to contribute a patch along those lines.

The hard part about it (and the reason I haven't written up a patch yet) is 
making sure the above property holds, while continuing to have good performance 
in the typical case where the input is validly encoded.  I suppose it's OK, 
though, if the change hurts performance only for the -P case, since -P is so 
slow anyway.




This bug report was last modified 9 years and 138 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.