#11621 - questionable locale sorting order (especially as related to char ranges in REs)

GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)

Reported by: Linda Walsh <coreutils <at> tlinx.org>

Date: Sun, 3 Jun 2012 22:16:02 UTC

Severity: normal

View this message in rfc822 format

From: "Linda A. Walsh" <lkml <at> tlinx.org> To: 11621 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigBrady.com> Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs) Date: Wed, 06 Jun 2012 18:16:02 -0700

[Message part 1 (text/plain, inline)]

Pádraig Brady wrote: > On 06/04/2012 06:03 AM, Linda A. Walsh wrote: > >> Pádraig Brady wrote: >> >>> On 06/03/2012 11:13 PM, Linda Walsh wrote: >>> >>>> Within in the past few years, use of ranges in RE's has become >>>> unreliable due to some locale changes sorting their native character >>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z). >>>> >>>> There seems to be a problem in when a user has set their system to use >>>> Unicode, it is no longer using the locale specific character set (iso-8859-x, >>>> or others). >>>> >> ---- >> To clarify my above statement: >> >> >> There seems to be a problem in when a user has set their system to use >> Unicode: It is no longer using the locale specific character set (iso-8859-x, >> or others) -- ***or*** *their* *orderings*. I.e. Unicode defines a collation >> order -- I don't know that they others do ('C' does, but I don't know about >> other locale-specific character sets). >> >> >> >>> It's not specific to "unicode". Sorting in a iso-8859-1 charset >>> results in locale ordering: >>> >> ---- >> Can you cite a source specifying the sort/collation order of the >> iso-8859-1 charset that would prove that it is not-conforming to the collation specification for that charset? >> >> I.e. If there is no official source, then the order with that charset >> is "undefined", and while it may not be desirable, returning a<A<b<B, would not be "an error". >> > > It's a charset. Of course the order is defined. Try: man iso-8859-1 > > The relative ordering can be trivially inferred from the command I presented. > But to be explicit: > > $ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US [sic] sort | iconv -f iso-8859-1 > a > A > á > b > ---- Your example doesn't show the collation order of iso-8859-1. You are setting it to 'en_US' (as LC_ALL overrides all other LC vars; LANG sets the default, but individual settings in the LC variables can override it. A corrected example: $ (Charset=iso-8859-1; printf "%s\n" A b B a á | iconv -t $Charset | LANG=en_US LC_CHARSET=$Charset LC_COLLATE=$Charset sort | iconv -f $Charset |tr "\n" " ";echo "") A B a b á (I used 'Charset' to hold the charset name, added parens, printed them in the same orientation as input, and added a 2nd capital letter to make upper/lower case ordering clear.) I might note how "trivial" it was to arrive at incorrect output. People often think me a pain because I ask them to explain what they perceive to be obvious. Unfortunately, what is obvious to 1 person may not be so to another. The 'á' is not ASCII (original charset for C locale, coming from unix & C programming language -- a reason why POSIX renamed the 'C' local to the POSIX locale. However, as 'á' is in the 1st 256 chars (above the ASCII range), it can still work if you remove the iconv stuff (and note, I have no other locale vars set: $ echo ${!LC_*} ${!LAN*} LC_COLLATE LC_CTYPE $ (Charset=ASCII; printf "%s\n" A B b a á | LC_CHARSET=$Charset LC_COLLATE=$Charset sort |tr "\n" " ";echo "") A B a b á To bring this to completion -- most linux systems today use the UTF-8 character set. It shows an *identical* collation order for the above chars as the iso-8859-1 charset. It appears that the collating functions are confused by the notation that has been adopted in many distributions...namely <locale>.charset. In such a notation, where the charset has been explicitly specified, and where the charset has explicit COLLATION and case folding rules (those for Unicode are extensive and handle accents as well as other forms like ſȘșʂȿᵴᶊṠṡṢṣṤṥṦṧṨṩẛẜẝẞⱾꞨꞩSsßŚśŜŝŞşŠšˢ...etc. Therefore, I would like to see the character set's collation and folding rules used where they are officially specified (as in the case of Unicode or POSIX). Are you the person responsible for the libicuXXX files?

[Message part 2 (text/html, inline)]

This bug report was last modified 13 years and 66 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #11621 questionable locale sorting order (especially as related to char ranges in REs)

GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)