GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)

Previous Next

Package: coreutils;

Reported by: Linda Walsh <coreutils <at> tlinx.org>

Date: Sun, 3 Jun 2012 22:16:02 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com>
To: "Linda A. Walsh" <lkml <at> tlinx.org>
Cc: 11621 <at> debbugs.gnu.org
Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Date: Mon, 04 Jun 2012 09:48:52 +0100
On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
> 
> 
> Pádraig Brady wrote:
>> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>>> Within in the past few years, use of ranges in RE's has become
>>> unreliable due to some locale changes sorting their native character
>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>>
>>> There seems to be a problem in when a user has set their system to use
>>> Unicode, it is no longer using the locale specific character set (iso-8859-x,
>>> or others).
> ----
>     To clarify my above statement:
> 
> 
>    There seems to be a problem in when a user has set their system to use
> Unicode: It is no longer using the locale specific character set (iso-8859-x,
> or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a collation
> order -- I don't know that they others do ('C' does, but I don't know about
> other locale-specific character sets).
> 
> 
>> It's not specific to "unicode". Sorting in a iso-8859-1 charset
>> results in locale ordering:
> ----
>     Can you cite a source specifying the sort/collation order of the
> iso-8859-1 charset that would prove that it is not-conforming to the collation
> specification for that charset?
> 
>     I.e. If there is no official source, then the order with that charset
> is "undefined", and while it may not be desirable, returning a<A<b<B, would not
> be "an error".

It's a charset. Of course the order is defined. Try: man iso-8859-1

The relative ordering can be trivially inferred from the command I presented.
But to be explicit:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f iso-8859-1
a
A
á
b

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=C sort | iconv -f iso-8859-1
A
a
b
á

> 
> 
> 
> 
>>> http://unicode.org/charts/case/chart_Latin.htm.
>>
>> http://unicode.org/charts/case/chart_Latin.html
> ---
>     ^^Correct^^ (typho)
> 
>>> Temporarily ignoring accents, only talking about lower and upper
>>> case letters, ...
>>
>> Well case comparison is a complicated area.
> ----
>     A bit, but it's mostly just wrong in the gnu library concerning unicode, and,
> as you are pointing out -- the 'C' encoding as well.
> the 'C' locale was the original charset used by the 'C' language -- only 8 bits
> wide.
> 
>     So how can it sort characters beyond the lower 256?
> This would seem to be meaningless and bugs output.

http://www.pixelbeat.org/docs/utf8_programming.html

> Is it?...   When the case comparison ordering is specified in a
> standard, it makes it fairly clear that one is either compliant with the standard
> or not.
> 
>     In this case, the Gnu sort/collation lib is not Unicode/UTF-8 compliant.
> 
>     What happens in other charsets may or may not be covered under some
> other standard -- e.g. the 'C'/ascii ordering is specified.  But I don't know
> if others have relevant standards or not.
> 
>>
>> For the special case of discounting accented chars etc.
>> you can use an attribute of the well designed UTF-8.
> ---
>     This is not exactly the point -- the point is that the core sort
> DOESN'T use that ordering.  That's the bug I am reporting.

Well you can't generally exclude accents.

> 
>     In reporting this, I'm trying to keep the argument 'simple' and focus on
> the problem of widely used ranges in the first 256 code-points of
> Unicode.
> 
>     Unicode gives a fairly extensive algorithm for handling accents,
> but I didn't want to complicate the discussion by "going there".  Please
> focus this bug on the lower 128 code points, as full unicode compliance
> with the full collation algorithm that is specified is likely to be a
> larger task.  HOWEVER, fixing the sorting/collation order of the lower
> 127 code points, is, comparatively a small task that conceivably could be
> fixed in the next release.

lower 127 = ASCII. If your input data is ASCII, just use LC_ALL=C.

>> Enabling traditional byte comparison on (normalized) UTF-8 data
>> will result in data sorted in Unicode code point order:
>> A b a á => A a b á
> 
> But you are missing the point (as well as raising an interesting 'feature'(?bug?)).
> 
> How is it that 'C' collation collates characters that are outside the ascii range?

Well whether C should be a "unicode" or "ascii" charset is a whole different
kettle of fish. I was just referring (as per the link above), that
UTF8 is well designed so that it works with many traditional single byte functions.

> I.e. -- you can't interpret input data as 'unicode' in the 'C' locale.
> So how does this work in the 'C' local?  AND more importantly -- it SHOULD work
> when charset is unicode (UTF-8)... and does not.  Test prog:
> ---------------
> #!/bin/bash
> set -m
> # vals to test:
> declare -a vals=( A a B b X x Y y Z z Ⅷ  Ⅴ Ⅲ Ⅰ Ⅿ Ⅽ ⅶ  ⅼ ⅲ )
> COLLATE_ORDER=C
> 
> function isatty {
>     local fd=${1:-1} ;
>     0<&$fd tty -s
> }
> 
> function ord {
>   local nl="";
>     isatty && nl="\n"
>     printf "%d$nl" "'$1"
> }
> 
> function background_print {
>     readarray -t inp
>     for ch in "${inp[@]}"; {
>         printf "%s   (U+%x)\n" "$ch" "$(ord "$ch")"
>     }
> }
> 
> 
> printf "%s\n" "${vals[@]}" |
>         LC_COLLATE=$COLLATE_ORDER sort |
>         background_print
> 
> ------------------------------------
> 
> Note, that the above produces:
> 
> /tmp/stest
> Ⅷ   (U+2167)
> Ⅴ   (U+2164)
> Ⅲ   (U+2162)
> Ⅰ   (U+2160)
> Ⅿ   (U+216f)
> Ⅽ   (U+216d)
> ⅶ   (U+2176)
> ⅼ   (U+217c)
> ⅲ   (U+2172)
> a   (U+61)
> A   (U+41)
> b   (U+62)
> B   (U+42)
> x   (U+78)
> X   (U+58)
> y   (U+79)
> Y   (U+59)
> z   (U+7a)
> Z   (U+5a)
> 
> NOT the output you showed...Seems there's a bug in the C collation order?

Note C doesn't use a collation order, it's simple byte comparison.
Seems there may be a bug in your script?
Also ensure that LC_ALL is not set, which will override LC_COLLATE.

$ printf "%s\n" A a B b 2 1 Ⅷ  ⅶ ⅲ | LC_COLLATE=C sort
1
2
A
B
a
b
Ⅷ
ⅲ
ⅶ

> 
> Changing collation order to UTF-8:
> 
> Same thing:
>  /tmp/stest
> Ⅷ   (U+2167)
> Ⅴ   (U+2164)
> Ⅲ   (U+2162)
> Ⅰ   (U+2160)
> Ⅿ   (U+216f)
> Ⅽ   (U+216d)
> ⅶ   (U+2176)
> ⅼ   (U+217c)
> ⅲ   (U+2172)
> a   (U+61)
> A   (U+41)
> b   (U+62)
> B   (U+42)
> x   (U+78)
> X   (U+58)
> y   (U+79)
> Y   (U+59)
> z   (U+7a)
> Z   (U+5a)
> 
> 
>>> I would assert this is a serious bug that should be addressed ASAP...
>>
>> As for the question in the subject for handling ranges in REs,
>> there has been recent work in changing as you suggest:
>>
>> http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105
> ----
> 
>     Recent?

?

> The most recent posts on that thread look to be from June of last year.
> I.e. a year ago.
> 
> I'm trying to stay focused on specific problems -- UTF-8 ordering is defined.
> the gnu library doesn't follow it.
> 
> Major problem with so many progs relying on the lib!...

cheers,
Pádraig.




This bug report was last modified 13 years and 66 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.