GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)

Previous Next

Package: coreutils;

Reported by: Linda Walsh <coreutils <at> tlinx.org>

Date: Sun, 3 Jun 2012 22:16:02 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com>
To: Linda Walsh <coreutils <at> tlinx.org>
Cc: 11621 <at> debbugs.gnu.org
Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Date: Sun, 03 Jun 2012 23:57:04 +0100
On 06/03/2012 11:13 PM, Linda Walsh wrote:
> Within in the past few years, use of ranges in RE's has become
> unreliable due to some locale changes sorting their native character
> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
> 
> Additionally many distro's have switched to UTF-8 resulting in
> localizations like en_GB.UTF-8, en_US.UTF-8, etc...
> 
> There seems to be a problem in when a user has set their system to use
> Unicode, it is no longer using the locale specific character set (iso-8859-x,
> or others).

It's not specific to "unicode". Sorting in a iso-8859-1 charset
results in locale ordering:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f iso-8859-1
a
A
á
b

> In Unicode, it is recommended that upper case be uniformly sorted
> below lower case (section 6.6, http://www.unicode.org/reports/tr10/).
> 
> A chart, including accent variations is at
> 
> http://unicode.org/charts/case/chart_Latin.htm.

http://unicode.org/charts/case/chart_Latin.html

> Temporarily ignoring accents, only talking about lower and upper
> case letters, you will note that the sorting order of A=41, B=42, C=43,
> while the lower case letters from 'a', have weights a=61, b=62, c=63.
> 
> This uniformly puts all lower case letters "after" any upper case letters.
> 
> Thus -- I am asserting, that any computer using a locale for country
> preferences, BUT is also using a unicode character set (e.g. UTF-8),
> should return sorted results as specified by the character set.
> 
> I.e. the utility 'sort' (and any programs that use the collation/sorting
> order specified in the core-utils libs) should return A-Z < a-z.

Well case comparison is a complicated area.

For the special case of discounting accented chars etc.
you can use an attribute of the well designed UTF-8.
Enabling traditional byte comparison on (normalized) UTF-8 data
will result in data sorted in Unicode code point order:

$ printf "%s\n" A b a á | LC_ALL=C sort
A
a
b
á

> This is currently not the case and is leading to erroneous results
> in programs written before locales were considered.  The thing is --
> in many cases, within some short period of locales being implemented,
> many or most distro's also switched to UTF-8.
> 
> Unfortunately it's collation order has not been respected.
> 
> I would assert this is a serious bug that should be addressed ASAP...

As for the question in the subject for handling ranges in REs,
there has been recent work in changing as you suggest:

http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105

cheers,
Pádraig.




This bug report was last modified 13 years and 66 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.