GNU bug report logs -
#11621
questionable locale sorting order (especially as related to char ranges in REs)
Previous Next
Full log
View this message in rfc822 format
Within in the past few years, use of ranges in RE's has become
unreliable due to some locale changes sorting their native character
sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
Additionally many distro's have switched to UTF-8 resulting in
localizations like en_GB.UTF-8, en_US.UTF-8, etc...
There seems to be a problem in when a user has set their system to use
Unicode, it is no longer using the locale specific character set
(iso-8859-x,
or others).
In Unicode, it is recommended that upper case be uniformly sorted
below lower case (section 6.6, http://www.unicode.org/reports/tr10/).
A chart, including accent variations is at
http://unicode.org/charts/case/chart_Latin.htm.
Temporarily ignoring accents, only talking about lower and upper
case letters, you will note that the sorting order of A=41, B=42, C=43,
while the lower case letters from 'a', have weights a=61, b=62, c=63.
This uniformly puts all lower case letters "after" any upper case letters.
Thus -- I am asserting, that any computer using a local for country
preferences, BUT is also using a unicode character set (e.g. UTF-8),
should return sorted results as specified by the character set.
I.e. the utility 'sort' (and any programs that use the collation/sorting
order specified in the core-utils libs) should return A-Z < a-z.
This is currently not the case and is leading to erroneous results
in programs written before locales were considered. The thing is --
in many cases, within some short period of locales being implemented,
many or most distro's also switched to UTF-8.
Unfortunately it's collation order has not been respected.
I would assert this is a serious bug that should be addressed ASAP...
Thanks,
Linda W.
This bug report was last modified 13 years and 66 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.