GNU bug report logs - #19242
latest grep considers text files as binary

Previous Next

Package: grep;

Reported by: Thomas Wolff <towo <at> computer.org>

Date: Mon, 1 Dec 2014 18:02:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #34 received at 19242 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Thomas Wolff <towo <at> towo.net>, Jim Meyering <meyering <at> fb.com>
Cc: 19242 <at> debbugs.gnu.org, noritnk <at> kcn.ne.jp
Subject: Re: latest grep considers text files as binary
Date: Sun, 22 Mar 2015 17:42:25 -0700
Thomas Wolff wrote:
> Hi Paul and Jim,
>
> Thanks for your previous quick responses on this matter and excuse my very late
> additional statement.
>
> However, the arguments are not convincing.
> The new behavior violates the principle of least astonishment which is well
> established in software design.

That cuts both ways.  Older versions of grep could dump core when given 
improperly encoded text, which is even more astonishing.  The new version is an 
improvement in that particular area.  It is not clear how grep could be modified 
to avoid the core dumps while still preserving the old behavior in question.

> It is not convincing that a text file is not considered a text file for a few
> bytes that are not properly encoded in the current locale. Also the quoted POSIX
> clause does not support that claim.

Not by itself, but from the chain of definitions it's clear that a text file 
must contain properly encoded text.  The quoted POSIX clause (3.397) says that a 
text file contains "characters", and an earlier clause (3.87) defines 
"character" to be "A sequence of one or more bytes representing a single graphic 
symbol or control code. Note: This term corresponds to the ISO C standard term 
multi-byte character".

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87

Because encoding errors are not characters, they are not text.

> And, considering the "pipe security" argument, shall all classic Unix tools now
> get additional options -a, so that something like
>      grep 'bla' | sed -e 'expr' | tr '' '' | grep -v 'argl'
> would in future look like
>      grep -a 'bla' | sed -a -e 'expr' | tr -a '' '' | grep -a -v 'argl'
> ?

It shouldn't be needed for tr, as tr's input is not required to be a text file.

GNU sed doesn't worry about whether files are text or binary.  I expect this is 
because the problem of spitting out random binary data tends to be less of an 
issue for 'sed' in practice.  However, portable scripts should not assume that 
'sed' will work on arbitrary binary data.

> What about backwards compability of scripts then?
> This is breaking decades of Unix tradition of modular tools for the mere
> dogmatics of some peculiar and strict locale theory.

UTF-8 does tend to have that effect, yes.  From the traditional Unix point of 
view, patterns like 'a.b' are "broken" with modern grep in UTF-8 locales, since 
the "." no longer matches only single bytes.  This has been true for decades, 
not just for 'grep' but also for 'sed' etc.  These days, though, users tend to 
be more interested in dealing with multibyte characters than in insisting on 
circa-1977 semantics in all cases.

> If you insist on this priority of locale strategy over Unix tradition,
> please offer at least a compatibility option that does not break scripts,
> i.e. an environment setting that enforces compatible behaviour (like other tools
> have, e.g. LS_COLORS etc).

Instead of an environment variable I suggest using a script.  Please see:

http://bugs.gnu.org/19998#8


> As a last remark, I wonder why my report does not show up in
> http://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
> and apparently I cannot submit anything there myself. Please get the issue
> documented there.

I unarchived that bug report and am quoting the entire new part of your message, 
which should do the trick.

> Kind regards,
> Thomas





This bug report was last modified 10 years and 64 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.