GNU bug report logs - #22838
New 'Binary file' detection considered harmful

Previous Next

Package: grep;

Reported by: Marcello Perathoner <marcello <at> perathoner.de>

Date: Sun, 28 Feb 2016 18:13:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Marcello Perathoner <marcello <at> perathoner.de>
Subject: bug#22838: closed (Re: bug#22838: New 'Binary file' detection
 considered harmful)
Date: Fri, 09 Sep 2016 01:44:02 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#22838: New 'Binary file' detection considered harmful

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 22838 <at> debbugs.gnu.org.

-- 
22838: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=22838
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Eric Blake <eblake <at> redhat.com>, 22838-done <at> debbugs.gnu.org
Cc: Hans Pelleboer <hanspelleboer <at> online.nl>,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>, Jim Meyering <jim <at> meyering.net>
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Thu, 8 Sep 2016 18:43:43 -0700
[Message part 3 (text/plain, inline)]
Paul Eggert wrote:
> On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
>> 2) If you just output
>>
>>    binary line 42 in file x matches
>>
>> and continue regular output after the next newline, the breakage would be much
>> more confined.
>
> This sounds like a good suggestion.  That is, grep could keep going if its only
> problem is an attempt to output encoding errors (as opposed to reading null
> bytes, which are a more-reliable indication of binary data).  It would probably
> be better to output just one "Binary file matches" line per file, at the end of
> the other matches, so that it's more likely to be noticed.

I finally got around to implementing this, which turned out to be considerably 
easier than I thought it would be. I installed the attached patch into the grep 
Savannah master. I am boldly closing this old bug report; we can always start a 
new report if further problems turn up.
[0001-grep-encoding-errors-suppress-just-their-line.patch (text/x-diff, attachment)]
[Message part 5 (message/rfc822, inline)]
From: Marcello Perathoner <marcello <at> perathoner.de>
To: bug-grep <at> gnu.org
Subject: New 'Binary file' detection considered harmful
Date: Sun, 28 Feb 2016 12:17:07 +0100
The new heuristics to detect 'Binary files' should be reverted to the 
old one (before 2.20) as the new one has too big a potential to silently 
fail important tasks.


One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar.  This is often done by backup 
software, eg. the in debian package 'backup2l'.

The new behaviour of grep -- to output 'Binary file matches' after 
output started -- has silently broken the 'backup2l' script and has the 
potential of silently breaking many other backup scripts as well.


Test case:

$ find /etc/ssl/certs/ | LANG= grep pem

Outcome:

grep will stop with 'Binary file (standard input) matches' after 
outputting a small percentage of the existing .pem files.

Expected behaviour:

grep should list all .pem files.


This behaviour is particularly insidious because users may not notice 
that their backup archives are a bit smaller than before or that their 
backups complete a bit faster, while many thousand files may be missing.



Q: Why do you use LANG= ?

A: To illustrate the problem and because 'backup2l' does that.

Q: Why don't people use the -a switch?

A: People may not notice anything wrong with their backups until they 
need them.

Q: Why don't you file a bug against 'backup2l'?

A: I will. But this is such a common use case that I suspect that many 
of the backup scripts that people wrote just for themselves are now broken.

Q: Why don't you just set the correct locale?

A: Even then it suffices to have one bogus-encoded filename somewhere to 
break your whole backup. It is easy to catch such a file from the 
internet or from song or picture metadata.



Regards

-- 
Marcello Perathoner




This bug report was last modified 8 years and 256 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.