GNU bug report logs - #19242
latest grep considers text files as binary

Previous Next

Package: grep;

Reported by: Thomas Wolff <towo <at> computer.org>

Date: Mon, 1 Dec 2014 18:02:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #25 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Thomas Wolff <towo <at> computer.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 Jim Meyering <meyering <at> fb.com>
Cc: 19242 <at> debbugs.gnu.org
Subject: Re: bug#19242: latest grep considers text files as binary
Date: Fri, 05 Dec 2014 08:39:48 -0700
[Message part 1 (text/plain, inline)]
On 12/05/2014 08:34 AM, Eric Blake wrote:
> On 12/05/2014 02:58 AM, Thomas Wolff wrote:
>> Paul Eggert wrote:
>>>> the mentioned patches are apparently intended to fix issues in
>>>> non-UTF-8 locales.
>>> No, they're also needed for UTF-8 locales I'm afraid.  There are some
>>> security issues, not only having to do with grep's internals, but also
>>> for the behavior of downstream programs that may be expecting UTF-8 text.
>>>
>>> You can work around the problem with 'grep -a'.
>> I was aware of this workaround but I claim it should not be needed
>> because the files affected are in fact not binary files but text files.
> 
> No, they are binary.  The POSIX definition of a text file states that
> the file may consist ONLY of characters in the current locale.  If you
> have files created under different locales, such that the bytes in the
> file are NOT characters in the current locale, then that file is binary
> under the current locale, even though it may be text in a better locale.
> 
>> The manual clearly says about -a: "Process a binary file as if it were
>> text" but partial content in a different text encoding does not make a
>> file binary.
> 
> Yes, it does, per POSIX.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_397

A file that contains characters organized into zero or more lines. The
lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
in length, including the <newline> character. Although POSIX.1-2008 does
not distinguish between text files and binary files (see the ISO C
standard), many utilities only produce predictable or meaningful output
when operating on text files. The standard utilities that have such
restrictions always specify "text files" in their STDIN or INPUT FILES
sections.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 10 years and 65 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.