GNU bug report logs -
#11621
questionable locale sorting order (especially as related to char ranges in REs)
Previous Next
Full log
Message #8 received at 11621 <at> debbugs.gnu.org (full text, mbox):
On 06/03/2012 11:13 PM, Linda Walsh wrote:
> Within in the past few years, use of ranges in RE's has become
> unreliable due to some locale changes sorting their native character
> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>
> Additionally many distro's have switched to UTF-8 resulting in
> localizations like en_GB.UTF-8, en_US.UTF-8, etc...
>
> There seems to be a problem in when a user has set their system to use
> Unicode, it is no longer using the locale specific character set (iso-8859-x,
> or others).
It's not specific to "unicode". Sorting in a iso-8859-1 charset
results in locale ordering:
$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f iso-8859-1
a
A
á
b
> In Unicode, it is recommended that upper case be uniformly sorted
> below lower case (section 6.6, http://www.unicode.org/reports/tr10/).
>
> A chart, including accent variations is at
>
> http://unicode.org/charts/case/chart_Latin.htm.
http://unicode.org/charts/case/chart_Latin.html
> Temporarily ignoring accents, only talking about lower and upper
> case letters, you will note that the sorting order of A=41, B=42, C=43,
> while the lower case letters from 'a', have weights a=61, b=62, c=63.
>
> This uniformly puts all lower case letters "after" any upper case letters.
>
> Thus -- I am asserting, that any computer using a locale for country
> preferences, BUT is also using a unicode character set (e.g. UTF-8),
> should return sorted results as specified by the character set.
>
> I.e. the utility 'sort' (and any programs that use the collation/sorting
> order specified in the core-utils libs) should return A-Z < a-z.
Well case comparison is a complicated area.
For the special case of discounting accented chars etc.
you can use an attribute of the well designed UTF-8.
Enabling traditional byte comparison on (normalized) UTF-8 data
will result in data sorted in Unicode code point order:
$ printf "%s\n" A b a á | LC_ALL=C sort
A
a
b
á
> This is currently not the case and is leading to erroneous results
> in programs written before locales were considered. The thing is --
> in many cases, within some short period of locales being implemented,
> many or most distro's also switched to UTF-8.
>
> Unfortunately it's collation order has not been respected.
>
> I would assert this is a serious bug that should be addressed ASAP...
As for the question in the subject for handling ranges in REs,
there has been recent work in changing as you suggest:
http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105
cheers,
Pádraig.
This bug report was last modified 13 years and 66 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.