GNU bug report logs - #38503
Locale can cause incorrect number parsing in binary files

Package: grep;

Reported by: jan h <jharald.j <at> gmail.com>

Date: Thu, 5 Dec 2019 20:02:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 38503 in the body.
You can then email your comments to 38503 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#38503; Package grep. (Thu, 05 Dec 2019 20:02:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to jan h <jharald.j <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 05 Dec 2019 20:02:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: jan h <jharald.j <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Locale can cause incorrect number parsing in binary files
Date: Thu, 5 Dec 2019 18:30:58 +0000

grep 3.3

I get a few weird symbols (seems valid utf-8), along with normal
numbers with the following simple snippet (.UTF-8 and .utf8 result in
same, even .UtF---8 is the same):
LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
meanwhile, with LC_ALL being C.UTF-8 this is not the case,
LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
consistently results in 1024 characters/bytes, as it's supposed to be...
it's not just en_US, it seems ANY utf-8 locale, other than C results
in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
show this bug, nor does en_US.iso88591...

worthy of note is that [[:digit:]] works correctly, while [0-9] does
not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
doesn't change anything either...

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Thu, 05 Dec 2019 20:30:03 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Thu, 05 Dec 2019 20:30:04 GMT) Full text and rfc822 format available.

Notification sent to jan h <jharald.j <at> gmail.com>:
bug acknowledged by developer. (Thu, 05 Dec 2019 20:30:05 GMT) Full text and rfc822 format available.

Message #12 received at 38503-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: jan h <jharald.j <at> gmail.com>, 38503-done <at> debbugs.gnu.org
Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary
 files
Date: Thu, 5 Dec 2019 14:29:19 -0600

tag 38503 notabug
thanks

On 12/5/19 12:30 PM, jan h wrote:
> grep 3.3
> 
> I get a few weird symbols (seems valid utf-8), along with normal
> numbers with the following simple snippet (.UTF-8 and .utf8 result in
> same, even .UtF---8 is the same):
> LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
> wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters

It's important to note that POSIX says that the regex [0-9] has 
locale-dependent effects.  Outside of the C/POSIX locale, it matches 
whatever the locale definition says it should.  For example, some 
locales allow [A-Z] to match non-ASCII letters like Á.  Similarly, as 
you have found, on your system, the en_US.UTF-8 locale is defined to 
match non-ASCII Unicode digits when a range expression for [0-9] is in 
force.

Note that the Rational Range Interpretation of ranges claims that [0-9] 
should have the expansion [012345689] in ALL locales; and more and more 
versions of GNU utilities are starting to move to RRI (even newer glibc 
is trying to move towards RRI for more regex operations).  If this 
example is run where RRI is in force, then it should not match non-ASCII 
Unicode digits.  But you didn't mention which version of grep you are 
using, let alone which version of libc is providing your locale 
definitions, to make that determination; and POSIX does not require RRI.

> meanwhile, with LC_ALL being C.UTF-8 this is not the case,
> LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
> consistently results in 1024 characters/bytes, as it's supposed to be...

Well, in the POSIX locale (C.UTF-8 is not the POSIX locale, but follows 
enough of the same rules), [0-9] _is_ required to match the same as 
[01234356789].  That's the only locale where you get RRI for free, 
rather than having to worry if your choice of program version and locale 
definition provide it.

> it's not just en_US, it seems ANY utf-8 locale, other than C results
> in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
> show this bug, nor does en_US.iso88591...

en_US.iso88591 does not have the problem because in that encoding, there 
aren't any non-ASCII digits.  So [0-9] will never match any non-ASCII 
Unicode digits because the charset in use doesn't have such characters.

> 
> worthy of note is that [[:digit:]] works correctly, while [0-9] does
> not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
> doesn't change anything either...

POSIX requires [[:digit:]] to expand to the same 10 characters in ALL 
locales, regardless of what the implementation does with [0-9], and 
regardless of whether an implementation uses RRI.  (This is true for 
[[:digit:]], but not for other named ranges; for example, [[:alpha:]] is 
still locale-dependent and may expand to more than 26 characters).

Since the problem you reported is due to your locale, I'm closing this 
as a non-bug. We may reopen it if additional details show that your 
version of grep was supposed to be using RRI but failed to do so.  And 
feel free to continue conversation, even if we don't reopen the bug.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Information forwarded to bug-grep <at> gnu.org:
bug#38503; Package grep. (Thu, 05 Dec 2019 20:41:01 GMT) Full text and rfc822 format available.

Message #15 received at 38503-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: jan h <jharald.j <at> gmail.com>, 38503-done <at> debbugs.gnu.org
Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary
 files
Date: Thu, 5 Dec 2019 14:40:42 -0600

On 12/5/19 2:29 PM, Eric Blake wrote:
> tag 38503 notabug
> thanks
> 
> On 12/5/19 12:30 PM, jan h wrote:
>> grep 3.3
>>

> 
> Note that the Rational Range Interpretation of ranges claims that [0-9] 
> should have the expansion [012345689] in ALL locales; and more and more 
> versions of GNU utilities are starting to move to RRI (even newer glibc 
> is trying to move towards RRI for more regex operations).  If this 
> example is run where RRI is in force, then it should not match non-ASCII 
> Unicode digits.  But you didn't mention which version of grep you are 
> using, let alone which version of libc is providing your locale 
> definitions, to make that determination; and POSIX does not require RRI.

Sorry, I missed that you did mention grep 3.3.  And the NEWS for grep 
does not mention 'RRI' or 'Rational Range Interpretation' (compare that 
to bash 4.2 introducing globasciiranges, or gawk introducing RRI in 
4.0.1).  So I'm not sure of the current state of whether grep tries to 
use RRI on all systems or only on systems where it relies on gnulib's 
regcomp instead of libc.  So we may still need to reopen this if we 
decide grep needs more RRI fixes.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Information forwarded to bug-grep <at> gnu.org:
bug#38503; Package grep. (Thu, 05 Dec 2019 20:44:03 GMT) Full text and rfc822 format available.

Message #18 received at submit <at> debbugs.gnu.org (full text, mbox):

From: jan h <jharald.j <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Re: Locale can cause incorrect number parsing in binary files
Date: Thu, 5 Dec 2019 18:40:21 +0000

On another machine with grep 3.1 this does not appear to be the case,
so, regression?

Kontakt jan h (<jharald.j <at> gmail.com>) kirjutas kuupäeval N, 5.
detsember 2019 kell 18:30:
>
> grep 3.3
>
> I get a few weird symbols (seems valid utf-8), along with normal
> numbers with the following simple snippet (.UTF-8 and .utf8 result in
> same, even .UtF---8 is the same):
> LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
> wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
> meanwhile, with LC_ALL being C.UTF-8 this is not the case,
> LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
> consistently results in 1024 characters/bytes, as it's supposed to be...
> it's not just en_US, it seems ANY utf-8 locale, other than C results
> in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
> show this bug, nor does en_US.iso88591...
>
> worthy of note is that [[:digit:]] works correctly, while [0-9] does
> not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
> doesn't change anything either...

Information forwarded to bug-grep <at> gnu.org:
bug#38503; Package grep. (Thu, 05 Dec 2019 20:44:03 GMT) Full text and rfc822 format available.

Message #21 received at submit <at> debbugs.gnu.org (full text, mbox):

From: jan h <jharald.j <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Re: Locale can cause incorrect number parsing in binary files
Date: Thu, 5 Dec 2019 18:55:01 +0000

compiling from scratch resulted in a normal, working version
apparently Arch's package was somehow badly made?

Information forwarded to bug-grep <at> gnu.org:
bug#38503; Package grep. (Thu, 05 Dec 2019 20:51:01 GMT) Full text and rfc822 format available.

Message #24 received at 38503 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: jan h <jharald.j <at> gmail.com>, 38503 <at> debbugs.gnu.org
Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary
 files
Date: Thu, 5 Dec 2019 14:50:12 -0600

On 12/5/19 12:55 PM, jan h wrote:
> compiling from scratch resulted in a normal, working version
> apparently Arch's package was somehow badly made?

You also need to check whether your builds were using gnulib's regcomp 
replacement, or sticking with the one from glibc; and in turn which 
version of glibc is in use (as it was glibc 2.28 that tried to use RRI 
in more locales, although work is still not complete there - and the 
presence or absence of particular historical glibc regcomp bugs 
determines whether configure decides to use gnulib's version instead).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Information forwarded to bug-grep <at> gnu.org:
bug#38503; Package grep. (Thu, 05 Dec 2019 20:57:02 GMT) Full text and rfc822 format available.

Message #27 received at 38503-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>, jan h <jharald.j <at> gmail.com>,
 38503-done <at> debbugs.gnu.org
Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary
 files
Date: Thu, 5 Dec 2019 12:56:03 -0800

On 12/5/19 12:40 PM, Eric Blake wrote:
> I'm not sure of the current state of whether grep tries to use RRI on 
> all systems or only on systems where it relies on gnulib's regcomp 
> instead of libc.

As I recall, grep doesn't make any special effort to use RRI. That is, 
if the underlying library uses RRI, then grep does so as well; otherwise 
it doesn't.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 03 Jan 2020 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 230 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #38503 Locale can cause incorrect number parsing in binary files

GNU bug report logs - #38503
Locale can cause incorrect number parsing in binary files