GNU bug report logs -
#38503
Locale can cause incorrect number parsing in binary files
Previous Next
Reported by: jan h <jharald.j <at> gmail.com>
Date: Thu, 5 Dec 2019 20:02:01 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
On another machine with grep 3.1 this does not appear to be the case,
so, regression?
Kontakt jan h (<jharald.j <at> gmail.com>) kirjutas kuupƤeval N, 5.
detsember 2019 kell 18:30:
>
> grep 3.3
>
> I get a few weird symbols (seems valid utf-8), along with normal
> numbers with the following simple snippet (.UTF-8 and .utf8 result in
> same, even .UtF---8 is the same):
> LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
> wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
> meanwhile, with LC_ALL being C.UTF-8 this is not the case,
> LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
> consistently results in 1024 characters/bytes, as it's supposed to be...
> it's not just en_US, it seems ANY utf-8 locale, other than C results
> in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
> show this bug, nor does en_US.iso88591...
>
> worthy of note is that [[:digit:]] works correctly, while [0-9] does
> not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
> doesn't change anything either...
This bug report was last modified 5 years and 230 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.