#11621 - questionable locale sorting order (especially as related to char ranges in REs)

GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)

Reported by: Linda Walsh <coreutils <at> tlinx.org>

Date: Sun, 3 Jun 2012 22:16:02 UTC

Severity: normal

View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com> To: Linda Walsh <coreutils <at> tlinx.org> Cc: 11621 <at> debbugs.gnu.org Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs) Date: Sun, 03 Jun 2012 23:57:04 +0100

On 06/03/2012 11:13 PM, Linda Walsh wrote: > Within in the past few years, use of ranges in RE's has become > unreliable due to some locale changes sorting their native character > sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z). > > Additionally many distro's have switched to UTF-8 resulting in > localizations like en_GB.UTF-8, en_US.UTF-8, etc... > > There seems to be a problem in when a user has set their system to use > Unicode, it is no longer using the locale specific character set (iso-8859-x, > or others). It's not specific to "unicode". Sorting in a iso-8859-1 charset results in locale ordering: $ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f iso-8859-1 a A á b > In Unicode, it is recommended that upper case be uniformly sorted > below lower case (section 6.6, http://www.unicode.org/reports/tr10/). > > A chart, including accent variations is at > > http://unicode.org/charts/case/chart_Latin.htm. http://unicode.org/charts/case/chart_Latin.html > Temporarily ignoring accents, only talking about lower and upper > case letters, you will note that the sorting order of A=41, B=42, C=43, > while the lower case letters from 'a', have weights a=61, b=62, c=63. > > This uniformly puts all lower case letters "after" any upper case letters. > > Thus -- I am asserting, that any computer using a locale for country > preferences, BUT is also using a unicode character set (e.g. UTF-8), > should return sorted results as specified by the character set. > > I.e. the utility 'sort' (and any programs that use the collation/sorting > order specified in the core-utils libs) should return A-Z < a-z. Well case comparison is a complicated area. For the special case of discounting accented chars etc. you can use an attribute of the well designed UTF-8. Enabling traditional byte comparison on (normalized) UTF-8 data will result in data sorted in Unicode code point order: $ printf "%s\n" A b a á | LC_ALL=C sort A a b á > This is currently not the case and is leading to erroneous results > in programs written before locales were considered. The thing is -- > in many cases, within some short period of locales being implemented, > many or most distro's also switched to UTF-8. > > Unfortunately it's collation order has not been respected. > > I would assert this is a serious bug that should be addressed ASAP... As for the question in the subject for handling ranges in REs, there has been recent work in changing as you suggest: http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105 cheers, Pádraig.

This bug report was last modified 13 years and 66 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #11621 questionable locale sorting order (especially as related to char ranges in REs)

GNU bug report logs - #11621
questionable locale sorting order (especially as related to char ranges in REs)