#33837 - Unexpected result for regex with non-ascii range

GNU bug report logs - #33837
Unexpected result for regex with non-ascii range

Package: grep;

Reported by: Reinis Danne <rei4dan <at> gmail.com>

Date: Sat, 22 Dec 2018 21:34:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Jim Meyering <jim <at> meyering.net> To: rei4dan <at> gmail.com Cc: 33837 <at> debbugs.gnu.org Subject: bug#33837: Unexpected result for regex with non-ascii range Date: Sun, 23 Dec 2018 12:17:52 -0800

tags 33873 notabug close 33873 stop On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <rei4dan <at> gmail.com> wrote: > grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation > of yY for lv_LV.UTF-8 locale (by implementing rational range > interpretation?) [1]. > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774 > > However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results: > $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*' > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ > Ž > $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*' > a > āĀb > c > čČd ... > > For the uppercase the result is completely bogus, but for the lowercase range > it seems that accented uppercase letters are interleaved with the > lowercase ones. > > I would expect all letters to have their uppercase variants de-interleaved here. > > I don't know if grep alters the collation rules or it is done by glibc (2.28). > strxfrm() gives me this result: > Using LC_COLLATE=lv_LV.UTF-8 > char strxfrm > i c2b7010201020101e29b96 > I c2b7010201070101e2afb7 ... Thanks for the report. However, ... Using a multi-byte character as a range endpoint elicits what the standards documents call "unspecified behavior". Quoting grep's own manual, > Within a bracket expression, a "range expression" consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive. In the default C locale, the sorting sequence is the native character order; for example, '[a-d]' is equivalent to '[abcd]'. In other locales, the sorting sequence is not specified, and '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail to match any character, or the set of characters that it matches might even be erratic. To obtain the traditional interpretation of bracket expressions, you can use the 'C' locale by setting the 'LC_ALL' environment variable to the value 'C'. For the record, POSIX says this: http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html: > Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" and portability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include such characters that fall within the range based on that collating sequence, regardless of character values. In other locales, ranges have unspecified behavior. I am marking the auto-created issue as "not-a-bug", and can't even (reasonably) label it as "wishlist", because allowing what your usage implies is fundamentally contradictory. You're welcome to continue the discussion here.

This bug report was last modified 5 years and 199 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #33837 Unexpected result for regex with non-ascii range

GNU bug report logs - #33837
Unexpected result for regex with non-ascii range