GNU bug report logs -
#33837
Unexpected result for regex with non-ascii range
Previous Next
Reported by: Reinis Danne <rei4dan <at> gmail.com>
Date: Sat, 22 Dec 2018 21:34:02 UTC
Severity: normal
Tags: notabug
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi!
grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
of yY for lv_LV.UTF-8 locale (by implementing rational range
interpretation?) [1].
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
$ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
Ž
$ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
a
āĀb
c
čČd
e
ēĒf
g
ģĢh
i
īĪy
j
k
ķĶl
ļĻm
n
ņŅo
ōŌp
q
r
ŗŖs
šŠt
u
ūŪv
w
x
z
žŽ
For the uppercase the result is completely bogus, but for the lowercase range
it seems that accented uppercase letters are interleaved with the
lowercase ones.
I would expect all letters to have their uppercase variants de-interleaved here.
I don't know if grep alters the collation rules or it is done by glibc (2.28).
strxfrm() gives me this result:
Using LC_COLLATE=lv_LV.UTF-8
char strxfrm
i c2b7010201020101e29b96
I c2b7010201070101e2afb7
ī c2b70102140102020101e29bb7
Ī c2b70102140107020101e2b096
y c2b701030102
Y c2b701030107
j c382010201020101e29c96
J c382010201070101e2b0a4
Using LC_COLLATE=C.UTF-8
char strxfrm
i 6b
I 4b
ī c4ad
Ī c4ac
y 7b
Y 5b
j 6c
J 4c
Reinis
This bug report was last modified 5 years and 136 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.