GNU bug report logs - #22838
New 'Binary file' detection considered harmful

Previous Next

Package: grep;

Reported by: Marcello Perathoner <marcello <at> perathoner.de>

Date: Sun, 28 Feb 2016 18:13:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Marcello Perathoner <marcello <at> perathoner.de>
To: 22838 <at> debbugs.gnu.org
Subject: bug#22838: New 'Binary file' detection considered harmful
Date: Sun, 28 Feb 2016 12:17:07 +0100
The new heuristics to detect 'Binary files' should be reverted to the 
old one (before 2.20) as the new one has too big a potential to silently 
fail important tasks.


One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar.  This is often done by backup 
software, eg. the in debian package 'backup2l'.

The new behaviour of grep -- to output 'Binary file matches' after 
output started -- has silently broken the 'backup2l' script and has the 
potential of silently breaking many other backup scripts as well.


Test case:

$ find /etc/ssl/certs/ | LANG= grep pem

Outcome:

grep will stop with 'Binary file (standard input) matches' after 
outputting a small percentage of the existing .pem files.

Expected behaviour:

grep should list all .pem files.


This behaviour is particularly insidious because users may not notice 
that their backup archives are a bit smaller than before or that their 
backups complete a bit faster, while many thousand files may be missing.



Q: Why do you use LANG= ?

A: To illustrate the problem and because 'backup2l' does that.

Q: Why don't people use the -a switch?

A: People may not notice anything wrong with their backups until they 
need them.

Q: Why don't you file a bug against 'backup2l'?

A: I will. But this is such a common use case that I suspect that many 
of the backup scripts that people wrote just for themselves are now broken.

Q: Why don't you just set the correct locale?

A: Even then it suffices to have one bogus-encoded filename somewhere to 
break your whole backup. It is easy to catch such a file from the 
internet or from song or picture metadata.



Regards

-- 
Marcello Perathoner





This bug report was last modified 8 years and 256 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.