GNU bug report logs -
#33837
Unexpected result for regex with non-ascii range
Previous Next
Reported by: Reinis Danne <rei4dan <at> gmail.com>
Date: Sat, 22 Dec 2018 21:34:02 UTC
Severity: normal
Tags: notabug
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
tags 33873 notabug
close 33873
stop
On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <rei4dan <at> gmail.com> wrote:
> grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
> of yY for lv_LV.UTF-8 locale (by implementing rational range
> interpretation?) [1].
>
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
>
> However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
> $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
> Ž
> $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
> a
> āĀb
> c
> čČd
...
>
> For the uppercase the result is completely bogus, but for the lowercase range
> it seems that accented uppercase letters are interleaved with the
> lowercase ones.
>
> I would expect all letters to have their uppercase variants de-interleaved here.
>
> I don't know if grep alters the collation rules or it is done by glibc (2.28).
> strxfrm() gives me this result:
> Using LC_COLLATE=lv_LV.UTF-8
> char strxfrm
> i c2b7010201020101e29b96
> I c2b7010201070101e2afb7
...
Thanks for the report. However, ...
Using a multi-byte character as a range endpoint elicits what the
standards documents call "unspecified behavior".
Quoting grep's own manual,
> Within a bracket expression, a "range expression" consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive. In the default C locale, the sorting sequence is the native character order; for example, '[a-d]' is equivalent to '[abcd]'. In other locales, the sorting sequence is not specified, and '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail to match any character, or the set of characters that it matches might even be erratic. To obtain the traditional interpretation of bracket expressions, you can use the 'C' locale by setting the 'LC_ALL' environment variable to the value 'C'.
For the record, POSIX says this:
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html:
> Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" and portability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include such characters that fall within the range based on that collating sequence, regardless of character values. In other locales, ranges have unspecified behavior.
I am marking the auto-created issue as "not-a-bug", and can't even
(reasonably) label it as "wishlist", because allowing what your usage
implies is fundamentally contradictory.
You're welcome to continue the discussion here.
This bug report was last modified 5 years and 136 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.