GNU bug report logs -
#38503
Locale can cause incorrect number parsing in binary files
Previous Next
Reported by: jan h <jharald.j <at> gmail.com>
Date: Thu, 5 Dec 2019 20:02:01 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 38503 in the body.
You can then email your comments to 38503 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org:
bug#38503; Package
grep.
(Thu, 05 Dec 2019 20:02:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
jan h <jharald.j <at> gmail.com>:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org.
(Thu, 05 Dec 2019 20:02:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
grep 3.3
I get a few weird symbols (seems valid utf-8), along with normal
numbers with the following simple snippet (.UTF-8 and .utf8 result in
same, even .UtF---8 is the same):
LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
meanwhile, with LC_ALL being C.UTF-8 this is not the case,
LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
consistently results in 1024 characters/bytes, as it's supposed to be...
it's not just en_US, it seems ANY utf-8 locale, other than C results
in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
show this bug, nor does en_US.iso88591...
worthy of note is that [[:digit:]] works correctly, while [0-9] does
not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
doesn't change anything either...
Added tag(s) notabug.
Request was from
Eric Blake <eblake <at> redhat.com>
to
control <at> debbugs.gnu.org.
(Thu, 05 Dec 2019 20:30:03 GMT)
Full text and
rfc822 format available.
Reply sent
to
Eric Blake <eblake <at> redhat.com>:
You have taken responsibility.
(Thu, 05 Dec 2019 20:30:04 GMT)
Full text and
rfc822 format available.
Notification sent
to
jan h <jharald.j <at> gmail.com>:
bug acknowledged by developer.
(Thu, 05 Dec 2019 20:30:05 GMT)
Full text and
rfc822 format available.
Message #12 received at 38503-done <at> debbugs.gnu.org (full text, mbox):
tag 38503 notabug
thanks
On 12/5/19 12:30 PM, jan h wrote:
> grep 3.3
>
> I get a few weird symbols (seems valid utf-8), along with normal
> numbers with the following simple snippet (.UTF-8 and .utf8 result in
> same, even .UtF---8 is the same):
> LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
> wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
It's important to note that POSIX says that the regex [0-9] has
locale-dependent effects. Outside of the C/POSIX locale, it matches
whatever the locale definition says it should. For example, some
locales allow [A-Z] to match non-ASCII letters like Á. Similarly, as
you have found, on your system, the en_US.UTF-8 locale is defined to
match non-ASCII Unicode digits when a range expression for [0-9] is in
force.
Note that the Rational Range Interpretation of ranges claims that [0-9]
should have the expansion [012345689] in ALL locales; and more and more
versions of GNU utilities are starting to move to RRI (even newer glibc
is trying to move towards RRI for more regex operations). If this
example is run where RRI is in force, then it should not match non-ASCII
Unicode digits. But you didn't mention which version of grep you are
using, let alone which version of libc is providing your locale
definitions, to make that determination; and POSIX does not require RRI.
> meanwhile, with LC_ALL being C.UTF-8 this is not the case,
> LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
> consistently results in 1024 characters/bytes, as it's supposed to be...
Well, in the POSIX locale (C.UTF-8 is not the POSIX locale, but follows
enough of the same rules), [0-9] _is_ required to match the same as
[01234356789]. That's the only locale where you get RRI for free,
rather than having to worry if your choice of program version and locale
definition provide it.
> it's not just en_US, it seems ANY utf-8 locale, other than C results
> in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
> show this bug, nor does en_US.iso88591...
en_US.iso88591 does not have the problem because in that encoding, there
aren't any non-ASCII digits. So [0-9] will never match any non-ASCII
Unicode digits because the charset in use doesn't have such characters.
>
> worthy of note is that [[:digit:]] works correctly, while [0-9] does
> not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
> doesn't change anything either...
POSIX requires [[:digit:]] to expand to the same 10 characters in ALL
locales, regardless of what the implementation does with [0-9], and
regardless of whether an implementation uses RRI. (This is true for
[[:digit:]], but not for other named ranges; for example, [[:alpha:]] is
still locale-dependent and may expand to more than 26 characters).
Since the problem you reported is due to your locale, I'm closing this
as a non-bug. We may reopen it if additional details show that your
version of grep was supposed to be using RRI but failed to do so. And
feel free to continue conversation, even if we don't reopen the bug.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-grep <at> gnu.org:
bug#38503; Package
grep.
(Thu, 05 Dec 2019 20:41:01 GMT)
Full text and
rfc822 format available.
Message #15 received at 38503-done <at> debbugs.gnu.org (full text, mbox):
On 12/5/19 2:29 PM, Eric Blake wrote:
> tag 38503 notabug
> thanks
>
> On 12/5/19 12:30 PM, jan h wrote:
>> grep 3.3
>>
>
> Note that the Rational Range Interpretation of ranges claims that [0-9]
> should have the expansion [012345689] in ALL locales; and more and more
> versions of GNU utilities are starting to move to RRI (even newer glibc
> is trying to move towards RRI for more regex operations). If this
> example is run where RRI is in force, then it should not match non-ASCII
> Unicode digits. But you didn't mention which version of grep you are
> using, let alone which version of libc is providing your locale
> definitions, to make that determination; and POSIX does not require RRI.
Sorry, I missed that you did mention grep 3.3. And the NEWS for grep
does not mention 'RRI' or 'Rational Range Interpretation' (compare that
to bash 4.2 introducing globasciiranges, or gawk introducing RRI in
4.0.1). So I'm not sure of the current state of whether grep tries to
use RRI on all systems or only on systems where it relies on gnulib's
regcomp instead of libc. So we may still need to reopen this if we
decide grep needs more RRI fixes.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-grep <at> gnu.org:
bug#38503; Package
grep.
(Thu, 05 Dec 2019 20:44:03 GMT)
Full text and
rfc822 format available.
Message #18 received at submit <at> debbugs.gnu.org (full text, mbox):
On another machine with grep 3.1 this does not appear to be the case,
so, regression?
Kontakt jan h (<jharald.j <at> gmail.com>) kirjutas kuupäeval N, 5.
detsember 2019 kell 18:30:
>
> grep 3.3
>
> I get a few weird symbols (seems valid utf-8), along with normal
> numbers with the following simple snippet (.UTF-8 and .utf8 result in
> same, even .UtF---8 is the same):
> LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
> wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
> meanwhile, with LC_ALL being C.UTF-8 this is not the case,
> LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
> consistently results in 1024 characters/bytes, as it's supposed to be...
> it's not just en_US, it seems ANY utf-8 locale, other than C results
> in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
> show this bug, nor does en_US.iso88591...
>
> worthy of note is that [[:digit:]] works correctly, while [0-9] does
> not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
> doesn't change anything either...
Information forwarded
to
bug-grep <at> gnu.org:
bug#38503; Package
grep.
(Thu, 05 Dec 2019 20:44:03 GMT)
Full text and
rfc822 format available.
Message #21 received at submit <at> debbugs.gnu.org (full text, mbox):
compiling from scratch resulted in a normal, working version
apparently Arch's package was somehow badly made?
Information forwarded
to
bug-grep <at> gnu.org:
bug#38503; Package
grep.
(Thu, 05 Dec 2019 20:51:01 GMT)
Full text and
rfc822 format available.
Message #24 received at 38503 <at> debbugs.gnu.org (full text, mbox):
On 12/5/19 12:55 PM, jan h wrote:
> compiling from scratch resulted in a normal, working version
> apparently Arch's package was somehow badly made?
You also need to check whether your builds were using gnulib's regcomp
replacement, or sticking with the one from glibc; and in turn which
version of glibc is in use (as it was glibc 2.28 that tried to use RRI
in more locales, although work is still not complete there - and the
presence or absence of particular historical glibc regcomp bugs
determines whether configure decides to use gnulib's version instead).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-grep <at> gnu.org:
bug#38503; Package
grep.
(Thu, 05 Dec 2019 20:57:02 GMT)
Full text and
rfc822 format available.
Message #27 received at 38503-done <at> debbugs.gnu.org (full text, mbox):
On 12/5/19 12:40 PM, Eric Blake wrote:
> I'm not sure of the current state of whether grep tries to use RRI on
> all systems or only on systems where it relies on gnulib's regcomp
> instead of libc.
As I recall, grep doesn't make any special effort to use RRI. That is,
if the underlying library uses RRI, then grep does so as well; otherwise
it doesn't.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org.
(Fri, 03 Jan 2020 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 5 years and 230 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.