GNU bug report logs - #19242
latest grep considers text files as binary

Package: grep;

Reported by: Thomas Wolff <towo <at> computer.org>

Date: Mon, 1 Dec 2014 18:02:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19242 in the body.
You can then email your comments to 19242 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#19242; Package grep. (Mon, 01 Dec 2014 18:02:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thomas Wolff <towo <at> computer.org>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 01 Dec 2014 18:02:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Thomas Wolff <towo <at> computer.org>
To: bug-grep <at> gnu.org
Cc: meyering <at> fb.com, eggert <at> cs.ucla.edu, noritnk <at> kcn.ne.jp
Subject: latest grep considers text files as binary
Date: Mon, 01 Dec 2014 18:05:51 +0100

[Message part 1 (text/plain, inline)]

Since grep 2.21, grep fails to report matches in a UTF-8 file with a few
non-UTF-8 bytes interspersed. This is likely to be related to one of the
recent patches related to encoding or multi-byte issues I see in the 
change log.

I have a number of large UTF-8 source files with some non-UTF-8 characters
used as constants and it was quite useful that grep nonetheless would
simply report the requested matches. Now it claims just
"Binary file ... matches" even if the file contains only one single
non-UTF-8 byte which I consider quite inappropriate.
I would appreciate to get the previous behaviour restored, at least in a
UTF-8 locale, as the mentioned patches are apparently intended to fix
issues in non-UTF-8 locales.

Kind regards,
Thomas

[Message part 2 (text/html, inline)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 01 Dec 2014 22:43:01 GMT) Full text and rfc822 format available.

Notification sent to Thomas Wolff <towo <at> computer.org>:
bug acknowledged by developer. (Mon, 01 Dec 2014 22:43:02 GMT) Full text and rfc822 format available.

Message #10 received at 19242-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 19242-done <at> debbugs.gnu.org
Cc: noritnk <at> kcn.ne.jp
Subject: Re: latest grep considers text files as binary
Date: Mon, 01 Dec 2014 14:41:53 -0800

Also marking Bug#19242 as done, since it's the same as Bug#19241.

Information forwarded to bug-grep <at> gnu.org:
bug#19242; Package grep. (Fri, 05 Dec 2014 10:00:02 GMT) Full text and rfc822 format available.

Message #13 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Thomas Wolff <towo <at> computer.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Jim Meyering <meyering <at> fb.com>
Cc: 19242 <at> debbugs.gnu.org, noritnk <at> kcn.ne.jp
Subject: Re: latest grep considers text files as binary
Date: Fri, 05 Dec 2014 10:58:49 +0100

[Message part 1 (text/plain, inline)]

Paul Eggert wrote:
>> the mentioned patches are apparently intended to fix issues in 
>> non-UTF-8 locales.
> No, they're also needed for UTF-8 locales I'm afraid.  There are some 
> security issues, not only having to do with grep's internals, but also 
> for the behavior of downstream programs that may be expecting UTF-8 text.
>
> You can work around the problem with 'grep -a'.
I was aware of this workaround but I claim it should not be needed 
because the files affected are in fact not binary files but text files. 
The manual clearly says about -a: "Process a binary file as if it were 
text" but partial content in a different text encoding does not make a 
file binary.

Jim Meyering wrote:
>   this is due to documented and desirable behavior.
I deny this is desirable behavior and I doubt there is a security issue 
as described. If any other, independent software has a security issue 
with non-UTF-8 input, it should decide itself to filter it and use 
accordingly stable decoding functions. It cannot be the task of any tool 
(grep in this case) to filter output to work around possible security 
issues in other programs in a pipe. This would be completely against the 
concept of pipes in the Unix tradition.

Honestly I think this is another case of practical usefulness losing 
against dogma in software design.

Kind regards,
Thomas

[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#19242; Package grep. (Fri, 05 Dec 2014 15:01:02 GMT) Full text and rfc822 format available.

Message #16 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Thomas Wolff <towo <at> computer.org>
Cc: Jim Meyering <meyering <at> fb.com>, Paul Eggert <eggert <at> cs.ucla.edu>,
 19242 <at> debbugs.gnu.org
Subject: Re: bug#19242: latest grep considers text files as binary
Date: Fri, 5 Dec 2014 07:00:21 -0800

On Fri, Dec 5, 2014 at 1:58 AM, Thomas Wolff <towo <at> computer.org> wrote:
> Paul Eggert wrote:
>>>
>>> the mentioned patches are apparently intended to fix issues in non-UTF-8
>>> locales.
>>
>> No, they're also needed for UTF-8 locales I'm afraid.  There are some
>> security issues, not only having to do with grep's internals, but also for
>> the behavior of downstream programs that may be expecting UTF-8 text.
>>
>> You can work around the problem with 'grep -a'.
>
> I was aware of this workaround but I claim it should not be needed because
> the files affected are in fact not binary files but text files. The manual
> clearly says about -a: "Process a binary file as if it were text" but
> partial content in a different text encoding does not make a file binary.
>
> Jim Meyering wrote:
>>
>>   this is due to documented and desirable behavior.
>
> I deny this is desirable behavior and I doubt there is a security issue as
> described. If any other, independent software has a security issue with
> non-UTF-8 input, it should decide itself to filter it and use accordingly
> stable decoding functions. It cannot be the task of any tool (grep in this
> case) to filter output to work around possible security issues in other
> programs in a pipe. This would be completely against the concept of pipes in
> the Unix tradition.

This is another side effect of using a multibyte locale.
As long as there are no NUL bytes in your input, you can work
around the issue by running grep in the C locale:

  LC_ALL=C grep ...

Information forwarded to bug-grep <at> gnu.org:
bug#19242; Package grep. (Fri, 05 Dec 2014 15:36:01 GMT) Full text and rfc822 format available.

Message #19 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Thomas Wolff <towo <at> computer.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 Jim Meyering <meyering <at> fb.com>
Cc: 19242 <at> debbugs.gnu.org
Subject: Re: bug#19242: latest grep considers text files as binary
Date: Fri, 05 Dec 2014 08:34:55 -0700

[Message part 1 (text/plain, inline)]

On 12/05/2014 02:58 AM, Thomas Wolff wrote:
> Paul Eggert wrote:
>>> the mentioned patches are apparently intended to fix issues in
>>> non-UTF-8 locales.
>> No, they're also needed for UTF-8 locales I'm afraid.  There are some
>> security issues, not only having to do with grep's internals, but also
>> for the behavior of downstream programs that may be expecting UTF-8 text.
>>
>> You can work around the problem with 'grep -a'.
> I was aware of this workaround but I claim it should not be needed
> because the files affected are in fact not binary files but text files.

No, they are binary.  The POSIX definition of a text file states that
the file may consist ONLY of characters in the current locale.  If you
have files created under different locales, such that the bytes in the
file are NOT characters in the current locale, then that file is binary
under the current locale, even though it may be text in a better locale.

> The manual clearly says about -a: "Process a binary file as if it were
> text" but partial content in a different text encoding does not make a
> file binary.

Yes, it does, per POSIX.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#19242; Package grep. (Fri, 05 Dec 2014 15:37:02 GMT) Full text and rfc822 format available.

Message #22 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Jim Meyering <jim <at> meyering.net>, Thomas Wolff <towo <at> computer.org>
Cc: Jim Meyering <meyering <at> fb.com>, Paul Eggert <eggert <at> cs.ucla.edu>,
 19242 <at> debbugs.gnu.org
Subject: Re: bug#19242: latest grep considers text files as binary
Date: Fri, 05 Dec 2014 08:36:17 -0700

[Message part 1 (text/plain, inline)]

On 12/05/2014 08:00 AM, Jim Meyering wrote:

>>
>> I deny this is desirable behavior and I doubt there is a security issue as
>> described. If any other, independent software has a security issue with
>> non-UTF-8 input, it should decide itself to filter it and use accordingly
>> stable decoding functions. It cannot be the task of any tool (grep in this
>> case) to filter output to work around possible security issues in other
>> programs in a pipe. This would be completely against the concept of pipes in
>> the Unix tradition.
> 
> This is another side effect of using a multibyte locale.
> As long as there are no NUL bytes in your input, you can work
> around the issue by running grep in the C locale:
> 
>   LC_ALL=C grep ...

Yes, the C locale has the nice effect of EVERY byte being a valid single
byte character, leaving only NUL bytes and a non-empty file not ending
in newline as the only reasons for a file to be marked binary.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#19242; Package grep. (Fri, 05 Dec 2014 15:40:04 GMT) Full text and rfc822 format available.

Message #25 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Thomas Wolff <towo <at> computer.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 Jim Meyering <meyering <at> fb.com>
Cc: 19242 <at> debbugs.gnu.org
Subject: Re: bug#19242: latest grep considers text files as binary
Date: Fri, 05 Dec 2014 08:39:48 -0700

[Message part 1 (text/plain, inline)]

On 12/05/2014 08:34 AM, Eric Blake wrote:
> On 12/05/2014 02:58 AM, Thomas Wolff wrote:
>> Paul Eggert wrote:
>>>> the mentioned patches are apparently intended to fix issues in
>>>> non-UTF-8 locales.
>>> No, they're also needed for UTF-8 locales I'm afraid.  There are some
>>> security issues, not only having to do with grep's internals, but also
>>> for the behavior of downstream programs that may be expecting UTF-8 text.
>>>
>>> You can work around the problem with 'grep -a'.
>> I was aware of this workaround but I claim it should not be needed
>> because the files affected are in fact not binary files but text files.
> 
> No, they are binary.  The POSIX definition of a text file states that
> the file may consist ONLY of characters in the current locale.  If you
> have files created under different locales, such that the bytes in the
> file are NOT characters in the current locale, then that file is binary
> under the current locale, even though it may be text in a better locale.
> 
>> The manual clearly says about -a: "Process a binary file as if it were
>> text" but partial content in a different text encoding does not make a
>> file binary.
> 
> Yes, it does, per POSIX.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_397

A file that contains characters organized into zero or more lines. The
lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
in length, including the <newline> character. Although POSIX.1-2008 does
not distinguish between text files and binary files (see the ISO C
standard), many utilities only produce predictable or meaningful output
when operating on text files. The standard utilities that have such
restrictions always specify "text files" in their STDIN or INPUT FILES
sections.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 03 Jan 2015 12:24:03 GMT) Full text and rfc822 format available.

Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 23 Mar 2015 00:10:03 GMT) Full text and rfc822 format available.

bug unarchived. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 23 Mar 2015 00:42:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#19242; Package grep. (Mon, 23 Mar 2015 00:43:01 GMT) Full text and rfc822 format available.

Message #34 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Thomas Wolff <towo <at> towo.net>, Jim Meyering <meyering <at> fb.com>
Cc: 19242 <at> debbugs.gnu.org, noritnk <at> kcn.ne.jp
Subject: Re: latest grep considers text files as binary
Date: Sun, 22 Mar 2015 17:42:25 -0700

Thomas Wolff wrote:
> Hi Paul and Jim,
>
> Thanks for your previous quick responses on this matter and excuse my very late
> additional statement.
>
> However, the arguments are not convincing.
> The new behavior violates the principle of least astonishment which is well
> established in software design.

That cuts both ways.  Older versions of grep could dump core when given 
improperly encoded text, which is even more astonishing.  The new version is an 
improvement in that particular area.  It is not clear how grep could be modified 
to avoid the core dumps while still preserving the old behavior in question.

> It is not convincing that a text file is not considered a text file for a few
> bytes that are not properly encoded in the current locale. Also the quoted POSIX
> clause does not support that claim.

Not by itself, but from the chain of definitions it's clear that a text file 
must contain properly encoded text.  The quoted POSIX clause (3.397) says that a 
text file contains "characters", and an earlier clause (3.87) defines 
"character" to be "A sequence of one or more bytes representing a single graphic 
symbol or control code. Note: This term corresponds to the ISO C standard term 
multi-byte character".

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87

Because encoding errors are not characters, they are not text.

> And, considering the "pipe security" argument, shall all classic Unix tools now
> get additional options -a, so that something like
>      grep 'bla' | sed -e 'expr' | tr '' '' | grep -v 'argl'
> would in future look like
>      grep -a 'bla' | sed -a -e 'expr' | tr -a '' '' | grep -a -v 'argl'
> ?

It shouldn't be needed for tr, as tr's input is not required to be a text file.

GNU sed doesn't worry about whether files are text or binary.  I expect this is 
because the problem of spitting out random binary data tends to be less of an 
issue for 'sed' in practice.  However, portable scripts should not assume that 
'sed' will work on arbitrary binary data.

> What about backwards compability of scripts then?
> This is breaking decades of Unix tradition of modular tools for the mere
> dogmatics of some peculiar and strict locale theory.

UTF-8 does tend to have that effect, yes.  From the traditional Unix point of 
view, patterns like 'a.b' are "broken" with modern grep in UTF-8 locales, since 
the "." no longer matches only single bytes.  This has been true for decades, 
not just for 'grep' but also for 'sed' etc.  These days, though, users tend to 
be more interested in dealing with multibyte characters than in insisting on 
circa-1977 semantics in all cases.

> If you insist on this priority of locale strategy over Unix tradition,
> please offer at least a compatibility option that does not break scripts,
> i.e. an environment setting that enforces compatible behaviour (like other tools
> have, e.g. LS_COLORS etc).

Instead of an environment variable I suggest using a script.  Please see:

http://bugs.gnu.org/19998#8

> As a last remark, I wonder why my report does not show up in
> http://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
> and apparently I cannot submit anything there myself. Please get the issue
> documented there.

I unarchived that bug report and am quoting the entire new part of your message, 
which should do the trick.

> Kind regards,
> Thomas

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 20 Apr 2015 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 114 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #19242 latest grep considers text files as binary

GNU bug report logs - #19242
latest grep considers text files as binary