GNU bug report logs - #38503
Locale can cause incorrect number parsing in binary files

Previous Next

Package: grep;

Reported by: jan h <jharald.j <at> gmail.com>

Date: Thu, 5 Dec 2019 20:02:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: jan h <jharald.j <at> gmail.com>
Subject: bug#38503: closed (Re: bug#38503: Locale can cause incorrect
 number parsing in binary files)
Date: Thu, 05 Dec 2019 20:30:05 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#38503: Locale can cause incorrect number parsing in binary files

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 38503 <at> debbugs.gnu.org.

-- 
38503: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=38503
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Eric Blake <eblake <at> redhat.com>
To: jan h <jharald.j <at> gmail.com>, 38503-done <at> debbugs.gnu.org
Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary
 files
Date: Thu, 5 Dec 2019 14:29:19 -0600
tag 38503 notabug
thanks

On 12/5/19 12:30 PM, jan h wrote:
> grep 3.3
> 
> I get a few weird symbols (seems valid utf-8), along with normal
> numbers with the following simple snippet (.UTF-8 and .utf8 result in
> same, even .UtF---8 is the same):
> LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
> wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters

It's important to note that POSIX says that the regex [0-9] has 
locale-dependent effects.  Outside of the C/POSIX locale, it matches 
whatever the locale definition says it should.  For example, some 
locales allow [A-Z] to match non-ASCII letters like Á.  Similarly, as 
you have found, on your system, the en_US.UTF-8 locale is defined to 
match non-ASCII Unicode digits when a range expression for [0-9] is in 
force.

Note that the Rational Range Interpretation of ranges claims that [0-9] 
should have the expansion [012345689] in ALL locales; and more and more 
versions of GNU utilities are starting to move to RRI (even newer glibc 
is trying to move towards RRI for more regex operations).  If this 
example is run where RRI is in force, then it should not match non-ASCII 
Unicode digits.  But you didn't mention which version of grep you are 
using, let alone which version of libc is providing your locale 
definitions, to make that determination; and POSIX does not require RRI.

> meanwhile, with LC_ALL being C.UTF-8 this is not the case,
> LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
> consistently results in 1024 characters/bytes, as it's supposed to be...

Well, in the POSIX locale (C.UTF-8 is not the POSIX locale, but follows 
enough of the same rules), [0-9] _is_ required to match the same as 
[01234356789].  That's the only locale where you get RRI for free, 
rather than having to worry if your choice of program version and locale 
definition provide it.

> it's not just en_US, it seems ANY utf-8 locale, other than C results
> in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
> show this bug, nor does en_US.iso88591...

en_US.iso88591 does not have the problem because in that encoding, there 
aren't any non-ASCII digits.  So [0-9] will never match any non-ASCII 
Unicode digits because the charset in use doesn't have such characters.

> 
> worthy of note is that [[:digit:]] works correctly, while [0-9] does
> not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
> doesn't change anything either...

POSIX requires [[:digit:]] to expand to the same 10 characters in ALL 
locales, regardless of what the implementation does with [0-9], and 
regardless of whether an implementation uses RRI.  (This is true for 
[[:digit:]], but not for other named ranges; for example, [[:alpha:]] is 
still locale-dependent and may expand to more than 26 characters).

Since the problem you reported is due to your locale, I'm closing this 
as a non-bug. We may reopen it if additional details show that your 
version of grep was supposed to be using RRI but failed to do so.  And 
feel free to continue conversation, even if we don't reopen the bug.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[Message part 3 (message/rfc822, inline)]
From: jan h <jharald.j <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Locale can cause incorrect number parsing in binary files
Date: Thu, 5 Dec 2019 18:30:58 +0000
grep 3.3

I get a few weird symbols (seems valid utf-8), along with normal
numbers with the following simple snippet (.UTF-8 and .utf8 result in
same, even .UtF---8 is the same):
LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
meanwhile, with LC_ALL being C.UTF-8 this is not the case,
LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
consistently results in 1024 characters/bytes, as it's supposed to be...
it's not just en_US, it seems ANY utf-8 locale, other than C results
in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
show this bug, nor does en_US.iso88591...

worthy of note is that [[:digit:]] works correctly, while [0-9] does
not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
doesn't change anything either...



This bug report was last modified 5 years and 230 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.