GNU bug report logs - #22838
New 'Binary file' detection considered harmful

Package: grep;

Reported by: Marcello Perathoner <marcello <at> perathoner.de>

Date: Sun, 28 Feb 2016 18:13:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22838 in the body.
You can then email your comments to 22838 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Sun, 28 Feb 2016 18:13:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Marcello Perathoner <marcello <at> perathoner.de>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 28 Feb 2016 18:13:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Marcello Perathoner <marcello <at> perathoner.de>
To: bug-grep <at> gnu.org
Subject: New 'Binary file' detection considered harmful
Date: Sun, 28 Feb 2016 12:17:07 +0100

The new heuristics to detect 'Binary files' should be reverted to the 
old one (before 2.20) as the new one has too big a potential to silently 
fail important tasks.


One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar.  This is often done by backup 
software, eg. the in debian package 'backup2l'.

The new behaviour of grep -- to output 'Binary file matches' after 
output started -- has silently broken the 'backup2l' script and has the 
potential of silently breaking many other backup scripts as well.


Test case:

$ find /etc/ssl/certs/ | LANG= grep pem

Outcome:

grep will stop with 'Binary file (standard input) matches' after 
outputting a small percentage of the existing .pem files.

Expected behaviour:

grep should list all .pem files.


This behaviour is particularly insidious because users may not notice 
that their backup archives are a bit smaller than before or that their 
backups complete a bit faster, while many thousand files may be missing.



Q: Why do you use LANG= ?

A: To illustrate the problem and because 'backup2l' does that.

Q: Why don't people use the -a switch?

A: People may not notice anything wrong with their backups until they 
need them.

Q: Why don't you file a bug against 'backup2l'?

A: I will. But this is such a common use case that I suspect that many 
of the backup scripts that people wrote just for themselves are now broken.

Q: Why don't you just set the correct locale?

A: Even then it suffices to have one bogus-encoded filename somewhere to 
break your whole backup. It is easy to catch such a file from the 
internet or from song or picture metadata.



Regards

-- 
Marcello Perathoner

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Sun, 28 Feb 2016 22:14:01 GMT) Full text and rfc822 format available.

Message #8 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Sun, 28 Feb 2016 14:13:43 -0800

Marcello Perathoner wrote:

> The new behaviour of grep -- to output 'Binary file matches' after output
> started

I assume that the "new behavior" you're talking about is for grep 2.21 
(2014-11-23) and later, as that's the version of grep that started outputting 
"Binary file matches" due to input encoding errors. For example, on my platform 
(Ubuntu 15.10), the shell command:

LC_ALL=C awk 'BEGIN {for(i=1; i<256; i++) printf "%c %d\n", i, i}' |
LC_ALL=en_US.utf8 grep 126

outputs "Binary file (standard input) matches" in grep 2.21.

These changes were put in partly due to security issues, not only having to do 
with grep's internals (the old 'grep' would dump core sometimes when given 
encoding errors), but also for the benefit of invokers expecting properly 
encoded text.

To some extent we were stuck between a rock and a hard place here. No matter 
what 'grep' does, it will do the wrong thing for some usages. But overall we 
thought it better for grep's output to be valid text.

I think you can work around the problem for unfixed backup2l by setting your 
system's locale to a unibyte locale where all bytes are valid. The 
en_US.ISO-8859-15 locale, say.

Of course backup2l should get fixed, regardless of what we do with 'grep' or 
with your system locale.

> $ find /etc/ssl/certs/ | LANG= grep pem

Wouldn't the following be better?

find /etc/ssl/certs/ -name '*.pem'

This avoids false matches like '/etc/ssl/certs/pemmican'.  Alternatively:

find /etc/ssl/certs/ -print | grep -a '\.pem$'

> It is easy to catch such a file from the internet or from song or picture metadata.

None of the above approaches will work for arbitrary file names ("off the 
Internet"), because they all mishandle file names containing newlines. backup2l 
needs to do something like this:

find /etc/ssl/certs/ -name '*.pem' -print0

or like this:

find /etc/ssl/certs/ -print0 | grep -az '\.pem$'

with remaining code using null bytes instead of newlines to terminate file 
names. This is the sort of thing that backup2l should be doing.

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 17:15:02 GMT) Full text and rfc822 format available.

Message #11 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Marcello Perathoner <marcello <at> perathoner.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 18:14:04 +0100

On 02/28/2016 11:13 PM, Paul Eggert wrote:

> These changes were put in partly due to security issues, not only having
> to do with grep's internals (the old 'grep' would dump core sometimes
> when given encoding errors), but also for the benefit of invokers
> expecting properly encoded text.
>
> To some extent we were stuck between a rock and a hard place here. No
> matter what 'grep' does, it will do the wrong thing for some usages. But
> overall we thought it better for grep's output to be valid text.

You are driving out demons by Beelzebub.

grep is a core component of every unix system. You cannot change the 
behaviour or interface of such a fundamental tool without incurring in 
substantial breakage. Keeping the old bug is far wiser than to fix it 
and introduce a new bug.

Copying faulty input to the output is a preferable failure mode to 
dropping part of the expected output. People do not expect grep to 
validate their input but they do expect grep to produce a complete 
result set.

A text file with encoding problems is a text file and not a binary file.

>> $ find /etc/ssl/certs/ | LANG= grep pem
>
> Wouldn't the following be better?
>
> find /etc/ssl/certs/ -name '*.pem'

I'm not doing that. That was just an example to show how grep now gives 
incorrect results.

Many more cases can be made: any process that feeds tainted 
(user-provided) strings to grep can now be made to fail. Eg. a process 
that greps apache logs for known exploit signatures will now fail if the 
attacker sends a bogus user-agent string.

Regards

-- 
Marcello Perathoner

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 17:23:02 GMT) Full text and rfc822 format available.

Message #14 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 10:22:22 -0700

[Message part 1 (text/plain, inline)]

On 02/29/2016 10:14 AM, Marcello Perathoner wrote:
> 
> A text file with encoding problems is a text file and not a binary file.

Wrong, at least according to the POSIX definition of text file.  A text
file is one with no encoding errors.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 17:41:01 GMT) Full text and rfc822 format available.

Message #17 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Marcello Perathoner <marcello <at> perathoner.de>
To: Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu>,
 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 18:40:40 +0100

On 02/29/2016 06:22 PM, Eric Blake wrote:
> On 02/29/2016 10:14 AM, Marcello Perathoner wrote:
>>
>> A text file with encoding problems is a text file and not a binary file.
>
> Wrong, at least according to the POSIX definition of text file.  A text
> file is one with no encoding errors.

"""
3.397 Text File

A file that contains characters organized into zero or more lines. The 
lines do not contain NUL characters and none can exceed {LINE_MAX} bytes 
in length, including the <newline> character. Although POSIX.1-2008 does 
not distinguish between text files and binary files (see the ISO C 
standard), many utilities only produce predictable or meaningful output 
when operating on text files. The standard utilities that have such 
restrictions always specify "text files" in their STDIN or INPUT FILES 
sections.

"""

-- The Open Group Base Specifications Issue 7
IEEE Std 1003.1, 2013 Edition
Copyright © 2001-2013 The IEEE and The Open Group

Regards

-- 
Marcello Perathoner

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 17:55:02 GMT) Full text and rfc822 format available.

Message #20 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 10:54:52 -0700

[Message part 1 (text/plain, inline)]

On 02/29/2016 10:40 AM, Marcello Perathoner wrote:
>> Wrong, at least according to the POSIX definition of text file.  A text
>> file is one with no encoding errors.
> 
> 
> """
> 3.397 Text File
> 
> A file that contains characters organized into zero or more lines. The
> lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
> in length, including the <newline> character. Although POSIX.1-2008 does
> not distinguish between text files and binary files (see the ISO C
> standard), many utilities only produce predictable or meaningful output
> when operating on text files. The standard utilities that have such
> restrictions always specify "text files" in their STDIN or INPUT FILES
> sections.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html

> 
> 3.206 Line
> 
> A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
> 
> 3.87 Character
> 
> A sequence of one or more bytes representing a single graphic symbol or control code.
> 
> Note:
> This term corresponds to the ISO C standard term multi-byte character, where a single-byte character is a special case of a multi-byte character. Unlike the usage in the ISO C standard, character here has no necessary relationship with storage space, and byte is used when storage space is discussed.
> 
> See the definition of the portable character set in Portable Character Set for a further explanation of the graphical representations of (abstract) characters, as opposed to character encodings.
> 

Encoding errors are not characters, but bytes.  A line cannot contain
encoding errors.  Therefore, a file with encoding errors is not a text file.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 17:57:02 GMT) Full text and rfc822 format available.

Message #23 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 10:56:14 -0700

[Message part 1 (text/plain, inline)]

On 02/29/2016 10:54 AM, Eric Blake wrote:
> Encoding errors are not characters, but bytes.  A line cannot contain
> encoding errors.  Therefore, a file with encoding errors is not a text file.

Corollary - there exist files which are text files in some locales, but
binary files in others (based on whether the locale interprets the bytes
as an encoding error or as valid characters).

Yes, locale dependencies on standard behavior can be annoying.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 19:30:02 GMT) Full text and rfc822 format available.

Message #26 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 11:29:24 -0800

On 02/29/2016 09:14 AM, Marcello Perathoner wrote:
> Keeping the old bug is far wiser than to fix it and introduce a new bug.

That depends on the bugs in question. The old bugs were pretty bad.

> Copying faulty input to the output is a preferable failure mode

Again, we cannot satisfy everybody. There are reasonable complaints from 
users if 'grep' blasts improperly-encoded data to their terminals, or 
more generally if grep's improperly-encoded output trashes other 
programs that read the output.  This is why grep has the -a option.  It 
sounds like you need grep's -a option for your application, and it 
should be easy to use -a.  It's not clear that -a should be the default.

> any process that feeds tainted (user-provided) strings to grep can now 
> be made to fail. Eg. a process that greps apache logs for known 
> exploit signatures will now fail if the attacker sends a bogus 
> user-agent string.

Such a process won't fail if it uses grep's -a option, or if it treats 
the "Binary file matches" diagnostic as an indication that there are 
possible attacks, or if it is run in a unibyte locale where all bytes 
are valid characters, or if it looks at grep's exit status. Granted, 
slapdash approaches that don't do any of these things will be 
vulnerable, but they'll be vulnerable even with older grep versions.

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 20:12:01 GMT) Full text and rfc822 format available.

Message #29 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Marcello Perathoner <marcello <at> perathoner.de>
To: Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu>,
 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 21:11:02 +0100

On 02/29/2016 06:56 PM, Eric Blake wrote:
> On 02/29/2016 10:54 AM, Eric Blake wrote:
>> Encoding errors are not characters, but bytes.  A line cannot contain
>> encoding errors.  Therefore, a file with encoding errors is not a text file.
>
> Corollary - there exist files which are text files in some locales, but
> binary files in others (based on whether the locale interprets the bytes
> as an encoding error or as valid characters).
>
> Yes, locale dependencies on standard behavior can be annoying.
>

You assume that a user will only ever want to grep text files encoded in 
the machine's locale. That is not so.

As a German user I have on my disk files in many encodings: utf-8, 
iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like 
CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts, 
old WordStar files that used control characters inside.

Since 2.21 I will now have to always specify -a or LC_ALL=C when 
grepping my files.

Regards

-- 
Marcello Perathoner

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 20:35:01 GMT) Full text and rfc822 format available.

Message #32 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Marcello Perathoner <marcello <at> perathoner.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 21:34:46 +0100

On 02/29/2016 08:29 PM, Paul Eggert wrote:
> On 02/29/2016 09:14 AM, Marcello Perathoner wrote:
>> Keeping the old bug is far wiser than to fix it and introduce a new bug.
>
> That depends on the bugs in question. The old bugs were pretty bad.
>
>> Copying faulty input to the output is a preferable failure mode
>
> Again, we cannot satisfy everybody. There are reasonable complaints from
> users if 'grep' blasts improperly-encoded data to their terminals, or
> more generally if grep's improperly-encoded output trashes other
> programs that read the output.

They would 'blast' their terminals without grep too. I don't see any 
grounds for a complaint like that. Grep is not a sanitizer.

>  This is why grep has the -a option.  It
> sounds like you need grep's -a option for your application, and it
> should be easy to use -a.  It's not clear that -a should be the default.

I was lucky in that I noticed that a 17GB tar file could not be a 
complete backup of a 500GB drive. I was lucky because the now offending 
filename (the same filename that didn't bother grep for over 10 years) 
was early in the file list. If it had been late in the file list I 
wouldn't have noticed that a 400GB tar file was missing a few thousand 
files.

Other people may not be that lucky and they could get understandably 
angry at losing their data.

At least, if you must turn grep into a text file sanitizer, make the new 
behaviour optional. You can then tell people who complain about 
'blasted' terminals to turn on that option, while other people would not 
blindly incur into the new bug.

Regards

-- 
Marcello Perathoner

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 22:09:01 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Holger Bruenjes" <holgerbruenjes <at> gmx.net>
To: bug-grep <at> gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 21:29:46 +0100

[Message part 1 (text/plain, inline)]

Am 2016-02-29 um 21:11 schrieb Marcello Perathoner:

> As a German user I have on my disk files in many encodings: utf-8, 
> iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like 
> CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts, 
> old WordStar files that used control characters inside.
> 
> Since 2.21 I will now have to always specify -a or LC_ALL=C when 
> grepping my files.

You can use a wrapper for grep

mv grep in.grep


and create a new grep file with the following

LC_ALL=C; "/usr/bin/in.grep" "${@}"


that worked perfect

Holger

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 22:38:02 GMT) Full text and rfc822 format available.

Message #38 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 15:37:55 -0700

[Message part 1 (text/plain, inline)]

On 02/29/2016 01:11 PM, Marcello Perathoner wrote:

>> Yes, locale dependencies on standard behavior can be annoying.
>>
> 
> You assume that a user will only ever want to grep text files encoded in
> the machine's locale. That is not so.

You've been relying on undefined behavior, and it caught up with you.
It's the same as asking for us to keep use-after-free "working" in a
multithreaded program because it has always "worked" in your older
single-threaded program when nothing was perturbing the memory between
free() and its latent use.  A latent bug in your usage is still a bug in
your usage, even if it took a change in grep's defaults to expose your
problem.

And meanwhile, newer grep 2.23 has improved the heuristics to only
complain about a binary file if it would otherwise be outputting
encoding errors (rather than blindly complaining about the encoding
error up front and stopping processing immediately), which does
alleviate some of the worst of the change caused by your undefined usage
(that is, you can still grep for valid encodings, and get reasonable
results so long as the valid text doesn't mix with lines with invalid
encodings).

> 
> As a German user I have on my disk files in many encodings: utf-8,
> iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like
> CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,
> old WordStar files that used control characters inside.
> 
> Since 2.21 I will now have to always specify -a or LC_ALL=C when
> grepping my files.

Yes, but then you are no longer relying on undefined behavior, and
therefore have a leg to stand on if we break that behavior.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 23:36:02 GMT) Full text and rfc822 format available.

Message #41 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 15:35:19 -0800

On 02/29/2016 12:34 PM, Marcello Perathoner wrote:
> On 02/29/2016 08:29 PM, Paul Eggert wrote:
>
> They would 'blast' their terminals without grep too.

Sure, but in practice it's common for users to do something like this:

grep -r getaddrinfo_a *

I just now did this in my working copy of the GNU Emacs source code. If 
-a were the default, I would see 13874778 bytes on my screen, the vast 
majority of which would be useless or even harmful. As grep stands now, 
I see just 5480 bytes and they're mostly useful.

> I was lucky in that I noticed that a 17GB tar file could not be a 
> complete backup of a 500GB drive.

Yes, you were lucky there. But you were unlucky in that your backup 
software invoked grep without worrying about file name validity. Suppose 
a file name contained a newline? Your backups could be toast.

> At least ... make the new behaviour optional.

It is optional; we merely disagree about the option's default value.

> Since 2.21 I will now have to always specify -a or LC_ALL=C when
> grepping my files.

I suggest using -a. LC_ALL=C won't work the way that you want on 
platforms where the C locale is UTF-8, or is pure ASCII. For example, on 
Fedora 23 or RHEL 7 with grep 2.23 we have:

$ printf '\200\n' | LC_ALL=C grep .
Binary file (standard input) matches

This is because the C locale is pure ASCII on these platforms, i.e., 
'\200' is not a valid character the way it is with traditional Unix.  I 
don't know why Red Hat made that change.

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Mon, 29 Feb 2016 23:56:01 GMT) Full text and rfc822 format available.

Message #44 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>,
 Marcello Perathoner <marcello <at> perathoner.de>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 16:55:14 -0700

[Message part 1 (text/plain, inline)]

On 02/29/2016 04:35 PM, Paul Eggert wrote:

> I suggest using -a. LC_ALL=C won't work the way that you want on
> platforms where the C locale is UTF-8, or is pure ASCII. For example, on
> Fedora 23 or RHEL 7 with grep 2.23 we have:
> 
> $ printf '\200\n' | LC_ALL=C grep .
> Binary file (standard input) matches
> 
> This is because the C locale is pure ASCII on these platforms, i.e.,
> '\200' is not a valid character the way it is with traditional Unix.  I
> don't know why Red Hat made that change.

I _think_ the Austin Group is leaning towards requiring the "C" locale
to always be a unibyte locale with all 256 bytes as valid characters, so
neither strict 7-bit ASCII nor UTF-8 would be usable as the "C" locale;
but for that to happen, POSIX would also need to allow a way to get a
UTF-8 locale easily accessible and describe how it differs from the "C"
locale under such a ruling.  But it's still all conjecture on what the
final results will be - even in the standards committee, gracefully
documenting how locale corner cases must behave vs. leaving
implementations some latitude is tricky business; and any such change is
at least 3 or 4 years down the road before it could be standardized in
Issue 8 (right now, the focus is on Technical Corrigendum 2 for Issue 7).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Tue, 01 Mar 2016 01:25:02 GMT) Full text and rfc822 format available.

Message #47 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22838 <at> debbugs.gnu.org, Marcello Perathoner <marcello <at> perathoner.de>
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 17:23:45 -0800

On Mon, Feb 29, 2016 at 3:35 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 02/29/2016 12:34 PM, Marcello Perathoner wrote:
...
>> Since 2.21 I will now have to always specify -a or LC_ALL=C when
>> grepping my files.
>
> I suggest using -a. LC_ALL=C won't work the way that you want on platforms
> where the C locale is UTF-8, or is pure ASCII. For example, on Fedora 23 or
> RHEL 7 with grep 2.23 we have:
>
> $ printf '\200\n' | LC_ALL=C grep .
> Binary file (standard input) matches
>
> This is because the C locale is pure ASCII on these platforms, i.e., '\200'
> is not a valid character the way it is with traditional Unix.  I don't know
> why Red Hat made that change.

Wow. I hadn't noticed that using LC_ALL=C is inadequate.
Disturbing...

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Tue, 01 Mar 2016 02:25:02 GMT) Full text and rfc822 format available.

Message #50 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Bruce Dubbs <bruce.dubbs <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>,
 Marcello Perathoner <marcello <at> perathoner.de>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 20:24:34 -0600

Paul Eggert wrote:

> $ printf '\200\n' | LC_ALL=C grep .
> Binary file (standard input) matches
>
> This is because the C locale is pure ASCII on these platforms, i.e.,
> '\200' is not a valid character the way it is with traditional Unix.  I
> don't know why Red Hat made that change.

I also get the 'Binary file (standard input) matches' output from the 
above string on a Linux From Scratch system.  We build everything in a 
fairly generic way and did nothing special in this area.  I suspect this 
is something buried deep into glibc.

  -- Bruce

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Tue, 01 Mar 2016 04:03:01 GMT) Full text and rfc822 format available.

Message #53 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Hans Pelleboer <hanspelleboer <at> online.nl>
To: bug-grep <at> gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Tue, 1 Mar 2016 05:01:55 +0100

On 03/01/2016 12:55 AM, Eric Blake wrote:
> I _think_ the Austin Group is leaning towards requiring the "C" locale 
> to always be a unibyte locale with all 256 bytes as valid characters, 
> so neither strict 7-bit ASCII nor UTF-8 would be usable as the "C" 
> locale; but for that to happen, POSIX would also need to allow a way 
> to get a UTF-8 locale easily accessible and 
You do realize that this leaves all _non-US_users_, who rely on 
diacritics or even different character sets entirely
for their language, completely out in the cold.

> describe how it differs from the "C" locale under such a ruling. But 
> it's still all conjecture on what the final results will be - even in 
> the standards committee, gracefully documenting how locale corner 
> cases must behave vs. leaving implementations some latitude is tricky 
> business; and any such change is at least 3 or 4 years down the road 
> before it could be standardized in Issue 8 (right now, the focus is on 
> Technical Corrigendum 2 for Issue 7). 
Already back in _1987_, an IT professor in Leiden was especially 
appointed for the streamlining of
all the competing character sets that later were merged to become 
Unicode. Given the current
state of affairs, nearly thirty years down the road, I do not share your 
optimism that this issue
will be resolved in the next couple of years.

Hans Pelleboer

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Tue, 01 Mar 2016 10:06:02 GMT) Full text and rfc822 format available.

Message #56 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Marcello Perathoner <marcello <at> perathoner.de>
To: Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu>,
 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Tue, 1 Mar 2016 11:05:21 +0100

On 02/29/2016 11:37 PM, Eric Blake wrote:
> On 02/29/2016 01:11 PM, Marcello Perathoner wrote:
>
>>> Yes, locale dependencies on standard behavior can be annoying.
>>>
>>
>> You assume that a user will only ever want to grep text files encoded in
>> the machine's locale. That is not so.
>
> You've been relying on undefined behavior, and it caught up with you.

(The backup2l author has been relying. I'm just a user of that package 
and I already filed a bug against backup2l too.)

You confuse 'undefined' with 'undocumented'.  The old behaviour was very 
well defined, even if it could turn out nasty.  It was defined by 
implementation: it was a de-facto standard.

OTOH it was nowhere documented that grepping non-locale files was 
considered marginal or illegal.

The old documentation explicitly stated:

"""
If  the  first  few  bytes  of  a file indicate that the file contains 
binary data, assume that the file is of type TYPE. By  default, TYPE  is 
 binary,  and  grep normally outputs either a one-line message saying 
that a binary file matches, or no message if there is no match.
"""
--- from an old man page

The new behaviour changes documented old behaviour.

Furthermore there's no need to fix the old bug in such a heavy-handed 
way. Less disrupting alternatives:

1) Make the new behaviour an opt-in.  Print a deprecation warning that 
gives people a chance to fix their scripts.  After a while make the new 
behaviour the default.

2) If you just output

   binary line 42 in file x matches

and continue regular output after the next newline, the breakage would 
be much more confined.

3) Fail in the old documented way of printing only the error message 
instead of introducing a new mode of failure that looks like success and 
loses the error message in the noise.

4) Don't implement this change between minor releases. A breaking change 
deserves a major release.

Regards

-- 
Marcello Perathoner

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Tue, 01 Mar 2016 17:15:02 GMT) Full text and rfc822 format available.

Message #59 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Eric Blake <eblake <at> redhat.com>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Tue, 1 Mar 2016 09:14:08 -0800

On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
> 1) Make the new behaviour an opt-in.

Again, this is arguing over what the default should be. For many users, 
the new sort of behavior is better.

>
> 2) If you just output
>
>    binary line 42 in file x matches
>
> and continue regular output after the next newline, the breakage would 
> be much more confined.

This sounds like a good suggestion.  That is, grep could keep going if 
its only problem is an attempt to output encoding errors (as opposed to 
reading null bytes, which are a more-reliable indication of binary 
data).  It would probably be better to output just one "Binary file 
matches" line per file, at the end of the other matches, so that it's 
more likely to be noticed.

>
> 3) Fail in the old documented way of printing only the error message 
> instead of introducing a new mode of failure that looks like success 
> and loses the error message in the noise.

I don't understand this suggestion, as it's not an error or an error 
message.  But since I like (2) better perhaps it doesn't matter.

>
> 4) Don't implement this change between minor releases. A breaking 
> change deserves a major release.
>

Grep does not have minor releases. Whether to call the next release 
"2.24" or "3.0" is primarily a marketing decision, not a technical one.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 09 Sep 2016 01:44:02 GMT) Full text and rfc822 format available.

Notification sent to Marcello Perathoner <marcello <at> perathoner.de>:
bug acknowledged by developer. (Fri, 09 Sep 2016 01:44:02 GMT) Full text and rfc822 format available.

Message #64 received at 22838-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Eric Blake <eblake <at> redhat.com>, 22838-done <at> debbugs.gnu.org
Cc: Hans Pelleboer <hanspelleboer <at> online.nl>,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>, Jim Meyering <jim <at> meyering.net>
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Thu, 8 Sep 2016 18:43:43 -0700

[Message part 1 (text/plain, inline)]

Paul Eggert wrote:
> On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
>> 2) If you just output
>>
>>    binary line 42 in file x matches
>>
>> and continue regular output after the next newline, the breakage would be much
>> more confined.
>
> This sounds like a good suggestion.  That is, grep could keep going if its only
> problem is an attempt to output encoding errors (as opposed to reading null
> bytes, which are a more-reliable indication of binary data).  It would probably
> be better to output just one "Binary file matches" line per file, at the end of
> the other matches, so that it's more likely to be noticed.

I finally got around to implementing this, which turned out to be considerably 
easier than I thought it would be. I installed the attached patch into the grep 
Savannah master. I am boldly closing this old bug report; we can always start a 
new report if further problems turn up.

[0001-grep-encoding-errors-suppress-just-their-line.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22838; Package grep. (Fri, 09 Sep 2016 05:22:01 GMT) Full text and rfc822 format available.

Message #67 received at 22838-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22838-done <at> debbugs.gnu.org, Eric Blake <eblake <at> redhat.com>,
 Hans Pelleboer <hanspelleboer <at> online.nl>,
 Marcello Perathoner <marcello <at> perathoner.de>,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Thu, 8 Sep 2016 22:20:36 -0700

On Thu, Sep 8, 2016 at 6:43 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Paul Eggert wrote:
>>
>> On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
>>>
>>> 2) If you just output
>>>
>>>    binary line 42 in file x matches
>>>
>>> and continue regular output after the next newline, the breakage would be
>>> much
>>> more confined.
>>
>>
>> This sounds like a good suggestion.  That is, grep could keep going if its
>> only
>> problem is an attempt to output encoding errors (as opposed to reading
>> null
>> bytes, which are a more-reliable indication of binary data).  It would
>> probably
>> be better to output just one "Binary file matches" line per file, at the
>> end of
>> the other matches, so that it's more likely to be noticed.
>
>
> I finally got around to implementing this, which turned out to be
> considerably easier than I thought it would be. I installed the attached
> patch into the grep Savannah master. I am boldly closing this old bug
> report; we can always start a new report if further problems turn up.

Very nice.  Thank you!

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 07 Oct 2016 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 256 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #22838 New 'Binary file' detection considered harmful

GNU bug report logs - #22838
New 'Binary file' detection considered harmful