GNU bug report logs - #22838
New 'Binary file' detection considered harmful

Previous Next

Package: grep;

Reported by: Marcello Perathoner <marcello <at> perathoner.de>

Date: Sun, 28 Feb 2016 18:13:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #8 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Sun, 28 Feb 2016 14:13:43 -0800
Marcello Perathoner wrote:

> The new behaviour of grep -- to output 'Binary file matches' after output
> started

I assume that the "new behavior" you're talking about is for grep 2.21 
(2014-11-23) and later, as that's the version of grep that started outputting 
"Binary file matches" due to input encoding errors. For example, on my platform 
(Ubuntu 15.10), the shell command:

LC_ALL=C awk 'BEGIN {for(i=1; i<256; i++) printf "%c %d\n", i, i}' |
LC_ALL=en_US.utf8 grep 126

outputs "Binary file (standard input) matches" in grep 2.21.

These changes were put in partly due to security issues, not only having to do 
with grep's internals (the old 'grep' would dump core sometimes when given 
encoding errors), but also for the benefit of invokers expecting properly 
encoded text.

To some extent we were stuck between a rock and a hard place here. No matter 
what 'grep' does, it will do the wrong thing for some usages. But overall we 
thought it better for grep's output to be valid text.

I think you can work around the problem for unfixed backup2l by setting your 
system's locale to a unibyte locale where all bytes are valid. The 
en_US.ISO-8859-15 locale, say.

Of course backup2l should get fixed, regardless of what we do with 'grep' or 
with your system locale.

> $ find /etc/ssl/certs/ | LANG= grep pem

Wouldn't the following be better?

find /etc/ssl/certs/ -name '*.pem'

This avoids false matches like '/etc/ssl/certs/pemmican'.  Alternatively:

find /etc/ssl/certs/ -print | grep -a '\.pem$'

> It is easy to catch such a file from the internet or from song or picture metadata.

None of the above approaches will work for arbitrary file names ("off the 
Internet"), because they all mishandle file names containing newlines. backup2l 
needs to do something like this:

find /etc/ssl/certs/ -name '*.pem' -print0

or like this:

find /etc/ssl/certs/ -print0 | grep -az '\.pem$'

with remaining code using null bytes instead of newlines to terminate file 
names. This is the sort of thing that backup2l should be doing.




This bug report was last modified 8 years and 256 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.