GNU bug report logs -
#9740
Bug in sort
Previous Next
Reported by: Lluís Padró <padro <at> lsi.upc.edu>
Date: Wed, 12 Oct 2011 18:49:02 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
Full log
Message #13 received at 9740-done <at> debbugs.gnu.org (full text, mbox):
Great, thanks!
On 12/10/11 21:02, Eric Blake wrote:
> tag 9740 notabug
> thanks
>
> On 10/12/2011 12:41 PM, Lluís Padró wrote:
>>
>> I found a bug in the "sort" utility that happens under utf8 locales, though
>> no character beyond basic ascii is involved in it...
>
> Thanks for the report; however, this is almost certainly a case of your locale defining a different
> collation order than what you were expecting. See the FAQ:
> https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
>
>>
>> I'm using "sort (GNU coreutils) 7.4" from package
>> "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS
>
> The latest version of coreutils, 8.14, includes a --debug option that makes it even more apparent
> why sort is behaving correctly:
>
>> ## Let's try another locale
>> ~$ export LC_ALL="en_US.UTF-8"
>
>> ## Sort fails. Shorter words are sorted after longer words with the same
>> prefix.
>> ~$ sort testfile
>> abcd Z
>> abce Z
>> abc Z
>> ab Z
>
> $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug
> sort: using `en_US.UTF-8' sorting rules
> abcd Z
> ______
> abce Z
> ______
> abc Z
> _____
> ab Z
> ____
>
> So, what exactly is sort comparing? The entire line (because you didn't specify any -k options to
> limit it to fields). And how does it do the comparison? By strcoll("abcd Z", "abc Z"). And how does
> strcoll() behave in the en_US.UTF-8 locale? By dictionary collation - that is, case and punctuation
> (including space) are ignored. So you get the same answer for both strcoll("abcd Z", "abc Z") and
> for strcoll("abcdz", "abcz") in that locale, and sure enough, d comes before z, so the sort is correct.
>
> You already figured out that LC_ALL=C forces sorting to honor byte values. But if you insist on
> using en_US collation, then maybe you should also look at forcing the sort to honor specific fields:
>
> $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2
> sort: using `en_US.UTF-8' sorting rules
> ab Z
> __
> _
> abc Z
> ___
> _
> abcd Z
> ____
> _
> abce Z
> ____
> _
>
>
--
---------------------------------------------------
Lluís Padró
Departament de Llenguatges i Sistemes Informàtics
Centre de Recerca TALP
UNIVERSITAT POLITÈCNICA DE CATALUNYA
http://www.lsi.upc.edu/~padro
---------------------------------------------------
This bug report was last modified 13 years and 276 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.