GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)

Previous Next

Package: coreutils;

Reported by: Linda Walsh <coreutils <at> tlinx.org>

Date: Sun, 3 Jun 2012 22:16:02 UTC

Severity: normal

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Linda Walsh <coreutils <at> tlinx.org>
To: bug-coreutils <at> gnu.org
Subject: questionable locale sorting order (especially as related to char
	ranges in REs)
Date: Sun, 03 Jun 2012 15:13:19 -0700
Within in the past few years, use of ranges in RE's has become
unreliable due to some locale changes sorting their native character
sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).

Additionally many distro's have switched to UTF-8 resulting in
localizations like en_GB.UTF-8, en_US.UTF-8, etc...

There seems to be a problem in when a user has set their system to use
Unicode, it is no longer using the locale specific character set 
(iso-8859-x,
or others).

In Unicode, it is recommended that upper case be uniformly sorted
below lower case (section 6.6, http://www.unicode.org/reports/tr10/).

A chart, including accent variations is at

http://unicode.org/charts/case/chart_Latin.htm.

Temporarily ignoring accents, only talking about lower and upper
case letters, you will note that the sorting order of A=41, B=42, C=43,
while the lower case letters from 'a', have weights a=61, b=62, c=63.

This uniformly puts all lower case letters "after" any upper case letters.

Thus -- I am asserting, that any computer using a local for country
preferences, BUT is also using a unicode character set (e.g. UTF-8),
should return sorted results as specified by the character set.

I.e. the utility 'sort' (and any programs that use the collation/sorting
order specified in the core-utils libs) should return A-Z < a-z.


This is currently not the case and is leading to erroneous results
in programs written before locales were considered.  The thing is --
in many cases, within some short period of locales being implemented,
many or most distro's also switched to UTF-8.

Unfortunately it's collation order has not been respected.

I would assert this is a serious bug that should be addressed ASAP...


Thanks,
Linda W.






This bug report was last modified 13 years and 66 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.