GNU bug report logs - #38503
Locale can cause incorrect number parsing in binary files

Previous Next

Package: grep;

Reported by: jan h <jharald.j <at> gmail.com>

Date: Thu, 5 Dec 2019 20:02:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: jan h <jharald.j <at> gmail.com>
To: 38503 <at> debbugs.gnu.org
Subject: bug#38503: Locale can cause incorrect number parsing in binary files
Date: Thu, 5 Dec 2019 18:30:58 +0000
grep 3.3

I get a few weird symbols (seems valid utf-8), along with normal
numbers with the following simple snippet (.UTF-8 and .utf8 result in
same, even .UtF---8 is the same):
LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters
meanwhile, with LC_ALL being C.UTF-8 this is not the case,
LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
consistently results in 1024 characters/bytes, as it's supposed to be...
it's not just en_US, it seems ANY utf-8 locale, other than C results
in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
show this bug, nor does en_US.iso88591...

worthy of note is that [[:digit:]] works correctly, while [0-9] does
not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
doesn't change anything either...




This bug report was last modified 5 years and 230 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.