GNU bug report logs -
#22838
New 'Binary file' detection considered harmful
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Thu, 8 Sep 2016 18:43:43 -0700
with message-id <3fa28b6a-9a78-375a-5978-46987a9bb681 <at> cs.ucla.edu>
and subject line Re: bug#22838: New 'Binary file' detection considered harmful
has caused the debbugs.gnu.org bug report #22838,
regarding New 'Binary file' detection considered harmful
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
22838: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=22838
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
The new heuristics to detect 'Binary files' should be reverted to the
old one (before 2.20) as the new one has too big a potential to silently
fail important tasks.
One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar. This is often done by backup
software, eg. the in debian package 'backup2l'.
The new behaviour of grep -- to output 'Binary file matches' after
output started -- has silently broken the 'backup2l' script and has the
potential of silently breaking many other backup scripts as well.
Test case:
$ find /etc/ssl/certs/ | LANG= grep pem
Outcome:
grep will stop with 'Binary file (standard input) matches' after
outputting a small percentage of the existing .pem files.
Expected behaviour:
grep should list all .pem files.
This behaviour is particularly insidious because users may not notice
that their backup archives are a bit smaller than before or that their
backups complete a bit faster, while many thousand files may be missing.
Q: Why do you use LANG= ?
A: To illustrate the problem and because 'backup2l' does that.
Q: Why don't people use the -a switch?
A: People may not notice anything wrong with their backups until they
need them.
Q: Why don't you file a bug against 'backup2l'?
A: I will. But this is such a common use case that I suspect that many
of the backup scripts that people wrote just for themselves are now broken.
Q: Why don't you just set the correct locale?
A: Even then it suffices to have one bogus-encoded filename somewhere to
break your whole backup. It is easy to catch such a file from the
internet or from song or picture metadata.
Regards
--
Marcello Perathoner
[Message part 3 (message/rfc822, inline)]
[Message part 4 (text/plain, inline)]
Paul Eggert wrote:
> On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
>> 2) If you just output
>>
>> binary line 42 in file x matches
>>
>> and continue regular output after the next newline, the breakage would be much
>> more confined.
>
> This sounds like a good suggestion. That is, grep could keep going if its only
> problem is an attempt to output encoding errors (as opposed to reading null
> bytes, which are a more-reliable indication of binary data). It would probably
> be better to output just one "Binary file matches" line per file, at the end of
> the other matches, so that it's more likely to be noticed.
I finally got around to implementing this, which turned out to be considerably
easier than I thought it would be. I installed the attached patch into the grep
Savannah master. I am boldly closing this old bug report; we can always start a
new report if further problems turn up.
[0001-grep-encoding-errors-suppress-just-their-line.patch (text/x-diff, attachment)]
This bug report was last modified 8 years and 257 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.