GNU bug report logs -
#11621
questionable locale sorting order (especially as related to char ranges in REs)
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Pádraig Brady wrote:
> On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
>
>> Pádraig Brady wrote:
>>
>>> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>>>
>>>> Within in the past few years, use of ranges in RE's has become
>>>> unreliable due to some locale changes sorting their native character
>>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>>>
>>>> There seems to be a problem in when a user has set their system to use
>>>> Unicode, it is no longer using the locale specific character set (iso-8859-x,
>>>> or others).
>>>>
>> ----
>> To clarify my above statement:
>>
>>
>> There seems to be a problem in when a user has set their system to use
>> Unicode: It is no longer using the locale specific character set (iso-8859-x,
>> or others) -- ***or*** *their* *orderings*. I.e. Unicode defines a collation
>> order -- I don't know that they others do ('C' does, but I don't know about
>> other locale-specific character sets).
>>
>>
>>
>>> It's not specific to "unicode". Sorting in a iso-8859-1 charset
>>> results in locale ordering:
>>>
>> ----
>> Can you cite a source specifying the sort/collation order of the
>> iso-8859-1 charset that would prove that it is not-conforming to the collation specification for that charset?
>>
>> I.e. If there is no official source, then the order with that charset
>> is "undefined", and while it may not be desirable, returning a<A<b<B, would not be "an error".
>>
>
> It's a charset. Of course the order is defined. Try: man iso-8859-1
>
> The relative ordering can be trivially inferred from the command I presented.
> But to be explicit:
>
> $ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US [sic] sort | iconv -f iso-8859-1
> a
> A
> á
> b
>
----
Your example doesn't show the collation order of iso-8859-1. You are
setting it to 'en_US' (as LC_ALL overrides all other LC vars; LANG sets
the default, but individual settings in the LC variables can override it.
A corrected example:
$ (Charset=iso-8859-1; printf "%s\n" A b B a á | iconv -t $Charset |
LANG=en_US LC_CHARSET=$Charset LC_COLLATE=$Charset sort | iconv -f
$Charset |tr "\n" " ";echo "")
A B a b á
(I used 'Charset' to hold the charset name, added parens, printed them
in the same orientation as input, and added a 2nd capital letter to make
upper/lower case ordering clear.)
I might note how "trivial" it was to arrive at incorrect output.
People often think me a pain because I ask them to explain what they
perceive to be
obvious. Unfortunately, what is obvious to 1 person may not be so to
another.
The 'á' is not ASCII (original charset for C locale, coming from
unix & C programming language -- a reason why POSIX renamed the 'C'
local to the POSIX
locale.
However, as 'á' is in the 1st 256 chars (above the ASCII range), it
can still work if you remove the iconv stuff (and note, I have no other
locale vars
set:
$ echo ${!LC_*} ${!LAN*}
LC_COLLATE LC_CTYPE
$ (Charset=ASCII; printf "%s\n" A B b a á | LC_CHARSET=$Charset
LC_COLLATE=$Charset sort |tr "\n" " ";echo "")
A B a b á
To bring this to completion -- most linux systems today use the UTF-8
character set. It shows an *identical* collation order for the above
chars as the iso-8859-1 charset.
It appears that the collating functions are confused by the notation
that has been adopted in many distributions...namely <locale>.charset.
In such a notation, where the charset has been explicitly specified, and
where the charset has explicit COLLATION and case folding rules (those
for Unicode are extensive and handle accents as well as other forms like
ſȘșʂȿᵴᶊṠṡṢṣṤṥṦṧṨṩẛẜẝẞⱾꞨꞩSsߌśŜŝŞşŠšˢ...etc.
Therefore, I would like to see the character set's collation and
folding rules used where they are officially specified (as in the case
of Unicode or POSIX).
Are you the person responsible for the libicuXXX files?
[Message part 2 (text/html, inline)]
This bug report was last modified 13 years and 66 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.