GNU bug report logs - #22838
New 'Binary file' detection considered harmful

Previous Next

Package: grep;

Reported by: Marcello Perathoner <marcello <at> perathoner.de>

Date: Sun, 28 Feb 2016 18:13:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#22838: closed (New 'Binary file' detection considered harmful)
Date: Fri, 09 Sep 2016 01:44:02 +0000
[Message part 1 (text/plain, inline)]
Your message dated Thu, 8 Sep 2016 18:43:43 -0700
with message-id <3fa28b6a-9a78-375a-5978-46987a9bb681 <at> cs.ucla.edu>
and subject line Re: bug#22838: New 'Binary file' detection considered harmful
has caused the debbugs.gnu.org bug report #22838,
regarding New 'Binary file' detection considered harmful
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
22838: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=22838
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Marcello Perathoner <marcello <at> perathoner.de>
To: bug-grep <at> gnu.org
Subject: New 'Binary file' detection considered harmful
Date: Sun, 28 Feb 2016 12:17:07 +0100
The new heuristics to detect 'Binary files' should be reverted to the 
old one (before 2.20) as the new one has too big a potential to silently 
fail important tasks.


One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar.  This is often done by backup 
software, eg. the in debian package 'backup2l'.

The new behaviour of grep -- to output 'Binary file matches' after 
output started -- has silently broken the 'backup2l' script and has the 
potential of silently breaking many other backup scripts as well.


Test case:

$ find /etc/ssl/certs/ | LANG= grep pem

Outcome:

grep will stop with 'Binary file (standard input) matches' after 
outputting a small percentage of the existing .pem files.

Expected behaviour:

grep should list all .pem files.


This behaviour is particularly insidious because users may not notice 
that their backup archives are a bit smaller than before or that their 
backups complete a bit faster, while many thousand files may be missing.



Q: Why do you use LANG= ?

A: To illustrate the problem and because 'backup2l' does that.

Q: Why don't people use the -a switch?

A: People may not notice anything wrong with their backups until they 
need them.

Q: Why don't you file a bug against 'backup2l'?

A: I will. But this is such a common use case that I suspect that many 
of the backup scripts that people wrote just for themselves are now broken.

Q: Why don't you just set the correct locale?

A: Even then it suffices to have one bogus-encoded filename somewhere to 
break your whole backup. It is easy to catch such a file from the 
internet or from song or picture metadata.



Regards

-- 
Marcello Perathoner



[Message part 3 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Eric Blake <eblake <at> redhat.com>, 22838-done <at> debbugs.gnu.org
Cc: Hans Pelleboer <hanspelleboer <at> online.nl>,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>, Jim Meyering <jim <at> meyering.net>
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Thu, 8 Sep 2016 18:43:43 -0700
[Message part 4 (text/plain, inline)]
Paul Eggert wrote:
> On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
>> 2) If you just output
>>
>>    binary line 42 in file x matches
>>
>> and continue regular output after the next newline, the breakage would be much
>> more confined.
>
> This sounds like a good suggestion.  That is, grep could keep going if its only
> problem is an attempt to output encoding errors (as opposed to reading null
> bytes, which are a more-reliable indication of binary data).  It would probably
> be better to output just one "Binary file matches" line per file, at the end of
> the other matches, so that it's more likely to be noticed.

I finally got around to implementing this, which turned out to be considerably 
easier than I thought it would be. I installed the attached patch into the grep 
Savannah master. I am boldly closing this old bug report; we can always start a 
new report if further problems turn up.
[0001-grep-encoding-errors-suppress-just-their-line.patch (text/x-diff, attachment)]

This bug report was last modified 8 years and 257 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.