GNU bug report logs -
#23234
unexpected results with charset handling in GNU grep 2.23
Previous Next
Reported by: Björn JACKE <bjoern <at> j3e.de>
Date: Wed, 6 Apr 2016 20:45:01 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #8 received at 23234 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 04/06/2016 01:25 PM, Björn JACKE wrote:
> Let's take this example using grep 2.23:
>
> # echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ; echo $?
[As a side point, 'echo -e' is non-portable; better is to use printf.]
Hmm. POSIX says that a file is binary if it does not end in newline, if
it contains embedded NUL, or if it contains an encoding error. But it
also says that LC_ALL=C is _required_ to treat all 256 byte values as
valid characters (ASCII is only required to treat 7-bit characters as
valid, and may reject 8-bit bytes, but LC_ALL=C is _not_ ASCII). This
indeed looks like a bug in current grep.git, as I can reproduce it:
$ git rev-parse HEAD
2ba6ab34da05d3aebc5e7e3dfaedb1cf3ddc5a73
$ printf "test\ntäst\ntest\n" | iconv -f utf8 -t latin1 |
LC_ALL=C src/grep "st"
test
Binary file (standard input) matches
Looks like we don't have something quite right in claiming that 0xe4 is
not a valid character when in the single-byte C locale.
> I really hope this change will be reverted as soon as possible. I would rather
> prefer GNU grep to become posix compliant and not do any binary detection by
> default actually.
The change of treating encoding errors as binary files will NOT be
reverted, but here, you HAVE pointed out a bug where we are treating
something as binary that is NOT an encoding error (because by
definition, LC_ALL=C has no encoding errors - all 256 byte values are
characters). So this is indeed a bug to be fixed.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
This bug report was last modified 9 years and 46 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.