GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)

Previous Next

Package: coreutils;

Reported by: Linda Walsh <coreutils <at> tlinx.org>

Date: Sun, 3 Jun 2012 22:16:02 UTC

Severity: normal

Full log


View this message in rfc822 format

From: "Linda A. Walsh" <lkml <at> tlinx.org>
To: 11621 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigBrady.com>
Subject: bug#11621: questionable locale sorting order (especially as related	to char ranges in REs)
Date: Wed, 06 Jun 2012 18:16:02 -0700
[Message part 1 (text/plain, inline)]
Pádraig Brady wrote:
> On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
>   
>> Pádraig Brady wrote:
>>     
>>> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>>>       
>>>> Within in the past few years, use of ranges in RE's has become
>>>> unreliable due to some locale changes sorting their native character
>>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>>>
>>>> There seems to be a problem in when a user has set their system to use
>>>> Unicode, it is no longer using the locale specific character set (iso-8859-x,
>>>> or others).
>>>>         
>> ----
>>     To clarify my above statement:
>>
>>
>>    There seems to be a problem in when a user has set their system to use
>> Unicode: It is no longer using the locale specific character set (iso-8859-x,
>> or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a collation
>> order -- I don't know that they others do ('C' does, but I don't know about
>> other locale-specific character sets).
>>
>>
>>     
>>> It's not specific to "unicode". Sorting in a iso-8859-1 charset
>>> results in locale ordering:
>>>       
>> ----
>>     Can you cite a source specifying the sort/collation order of the
>> iso-8859-1 charset that would prove that it is not-conforming to the collation specification for that charset?
>>
>>     I.e. If there is no official source, then the order with that charset
>> is "undefined", and while it may not be desirable, returning a<A<b<B, would not be "an error".
>>     
>
> It's a charset. Of course the order is defined. Try: man iso-8859-1
>
> The relative ordering can be trivially inferred from the command I presented.
> But to be explicit:
>
> $ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US [sic] sort | iconv -f iso-8859-1
> a
> A
> á
> b
>   
----
Your example doesn't show the collation order of iso-8859-1.   You are 
setting it to 'en_US' (as LC_ALL overrides all other LC vars; LANG sets 
the default, but individual settings in the LC variables can override it.

A corrected example:

$ (Charset=iso-8859-1; printf "%s\n" A b B a á | iconv -t $Charset | 
LANG=en_US LC_CHARSET=$Charset LC_COLLATE=$Charset sort | iconv -f 
$Charset |tr "\n" " ";echo "") 
A B a b á

(I used 'Charset' to hold the charset name, added parens, printed them 
in the same orientation as input, and added a 2nd capital letter to make 
upper/lower case ordering clear.)

   I might note how "trivial" it was to arrive at incorrect output.  
People often think me a pain because I ask them to explain what they 
perceive to be
obvious.  Unfortunately, what is obvious to 1 person may not be so to 
another.

   The 'á' is not ASCII (original charset for C locale, coming from 
unix & C programming language -- a reason why POSIX renamed the 'C' 
local to the POSIX
locale.

   However, as 'á' is in the 1st 256 chars (above the ASCII range), it 
can still work if you remove the iconv stuff (and note, I have no other 
locale vars
set:

$ echo ${!LC_*} ${!LAN*}
LC_COLLATE LC_CTYPE

$ (Charset=ASCII; printf "%s\n" A B b a á |  LC_CHARSET=$Charset 
LC_COLLATE=$Charset sort |tr "\n" " ";echo "")         
A B a b á

   To bring this to completion -- most linux systems today use the UTF-8
character set.  It shows an *identical* collation order for the above 
chars as the iso-8859-1 charset.

   It appears that the collating functions are confused by the notation 
that has been adopted in many distributions...namely <locale>.charset.   
In such a notation, where the charset has been explicitly specified, and 
where the charset has explicit COLLATION and case folding rules (those 
for Unicode are extensive and handle accents as well as other forms like 
ſȘșʂȿᵴᶊṠṡṢṣṤṥṦṧṨṩẛẜẝẞⱾꞨꞩSsߌśŜŝŞşŠšˢ...etc.

   Therefore, I would like to see the character set's collation and 
folding rules used where they are officially specified (as in the case 
of Unicode or POSIX).

   Are you the person responsible for the libicuXXX files?



[Message part 2 (text/html, inline)]

This bug report was last modified 13 years and 66 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.