GNU bug report logs - #33837
Unexpected result for regex with non-ascii range

Previous Next

Package: grep;

Reported by: Reinis Danne <rei4dan <at> gmail.com>

Date: Sat, 22 Dec 2018 21:34:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Reinis Danne <rei4dan <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Unexpected result for regex with non-ascii range
Date: Sat, 22 Dec 2018 21:43:46 +0200

Hi!

grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
of yY for lv_LV.UTF-8 locale (by implementing rational range
interpretation?) [1].

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774

However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
$ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
Ž
$ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
a
āĀb
c
čČd
e
ēĒf
g
ģĢh
i
īĪy
j
k
ķĶl
ļĻm
n
ņŅo
ōŌp
q
r
ŗŖs
šŠt
u
ūŪv
w
x
z
žŽ

For the uppercase the result is completely bogus, but for the lowercase range
it seems that accented uppercase letters are interleaved with the
lowercase ones.

I would expect all letters to have their uppercase variants de-interleaved here.

I don't know if grep alters the collation rules or it is done by glibc (2.28).
strxfrm() gives me this result:
Using LC_COLLATE=lv_LV.UTF-8
char    strxfrm
i    c2b7010201020101e29b96
I    c2b7010201070101e2afb7
ī    c2b70102140102020101e29bb7
Ī    c2b70102140107020101e2b096
y    c2b701030102
Y    c2b701030107
j    c382010201020101e29c96
J    c382010201070101e2b0a4
Using LC_COLLATE=C.UTF-8
char    strxfrm
i    6b
I    4b
ī    c4ad
Ī    c4ac
y    7b
Y    5b
j    6c
J    4c


Reinis

This bug report was last modified 5 years and 136 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #33837 Unexpected result for regex with non-ascii range

GNU bug report logs - #33837
Unexpected result for regex with non-ascii range