GNU bug report logs - #9740
Bug in sort

Reported by: Lluís Padró <padro <at> lsi.upc.edu>

Date: Wed, 12 Oct 2011 18:49:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9740 in the body.
You can then email your comments to 9740 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#9740; Package coreutils. (Wed, 12 Oct 2011 18:49:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Lluís Padró <padro <at> lsi.upc.edu>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 12 Oct 2011 18:49:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Lluís Padró <padro <at> lsi.upc.edu>
To: bug-coreutils <at> gnu.org
Subject: Bug in sort
Date: Wed, 12 Oct 2011 20:41:46 +0200

I found a bug in the "sort" utility that happens under utf8 locales, though
no character beyond basic ascii is involved in it...

I'm using "sort (GNU coreutils) 7.4" from package
 "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS

Short reproduction of the error follows below.

  thank you

     Lluis

------------------------------------------------
## test file for "sort"
~$ cat testfile
abc Z
ab Z
abcd Z
abce Z

## let's set C locale
~$ export LC_ALL="C"
~$ locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C

## sort works as expected
~$ sort testfile
ab Z
abc Z
abcd Z
abce Z

##  Let's try another locale
~$ export LC_ALL="en_US.UTF-8"
~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

##  Sort fails. Shorter words are sorted after longer words with the 
same prefix.
~$ sort testfile
abcd Z
abce Z
abc Z
ab Z

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Wed, 12 Oct 2011 19:04:02 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Wed, 12 Oct 2011 19:04:02 GMT) Full text and rfc822 format available.

Notification sent to Lluís Padró <padro <at> lsi.upc.edu>:
bug acknowledged by developer. (Wed, 12 Oct 2011 19:04:03 GMT) Full text and rfc822 format available.

Message #12 received at 9740-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Lluís Padró <padro <at> lsi.upc.edu>
Cc: 9740-done <at> debbugs.gnu.org
Subject: Re: bug#9740: Bug in sort
Date: Wed, 12 Oct 2011 13:02:30 -0600

tag 9740 notabug
thanks

On 10/12/2011 12:41 PM, Lluís Padró wrote:
>
> I found a bug in the "sort" utility that happens under utf8 locales, though
> no character beyond basic ascii is involved in it...

Thanks for the report; however, this is almost certainly a case of your 
locale defining a different collation order than what you were 
expecting.  See the FAQ:
https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

>
> I'm using "sort (GNU coreutils) 7.4" from package
> "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS

The latest version of coreutils, 8.14, includes a --debug option that 
makes it even more apparent why sort is behaving correctly:

> ## Let's try another locale
> ~$ export LC_ALL="en_US.UTF-8"

> ## Sort fails. Shorter words are sorted after longer words with the same
> prefix.
> ~$ sort testfile
> abcd Z
> abce Z
> abc Z
> ab Z

$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug
sort: using `en_US.UTF-8' sorting rules
abcd Z
______
abce Z
______
abc Z
_____
ab Z
____

So, what exactly is sort comparing?  The entire line (because you didn't 
specify any -k options to limit it to fields).  And how does it do the 
comparison?  By strcoll("abcd Z", "abc Z").  And how does strcoll() 
behave in the en_US.UTF-8 locale?  By dictionary collation - that is, 
case and punctuation (including space) are ignored.  So you get the same 
answer for both strcoll("abcd Z", "abc Z") and for strcoll("abcdz", 
"abcz") in that locale, and sure enough, d comes before z, so the sort 
is correct.

You already figured out that LC_ALL=C forces sorting to honor byte 
values.  But if you insist on using en_US collation, then maybe you 
should also look at forcing the sort to honor specific fields:

$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2
sort: using `en_US.UTF-8' sorting rules
ab Z
__
   _
abc Z
___
    _
abcd Z
____
     _
abce Z
____
     _

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Message #13 received at 9740-done <at> debbugs.gnu.org (full text, mbox):

From: Lluís Padró <padro <at> lsi.upc.edu>
To: Eric Blake <eblake <at> redhat.com>
Cc: 9740-done <at> debbugs.gnu.org
Subject: Re: bug#9740: Bug in sort
Date: Thu, 13 Oct 2011 09:29:00 +0200

  Great, thanks!


On 12/10/11 21:02, Eric Blake wrote:
> tag 9740 notabug
> thanks
>
> On 10/12/2011 12:41 PM, Lluís Padró wrote:
>>
>> I found a bug in the "sort" utility that happens under utf8 locales, though
>> no character beyond basic ascii is involved in it...
>
> Thanks for the report; however, this is almost certainly a case of your locale defining a different
> collation order than what you were expecting. See the FAQ:
> https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
>
>>
>> I'm using "sort (GNU coreutils) 7.4" from package
>> "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS
>
> The latest version of coreutils, 8.14, includes a --debug option that makes it even more apparent
> why sort is behaving correctly:
>
>> ## Let's try another locale
>> ~$ export LC_ALL="en_US.UTF-8"
>
>> ## Sort fails. Shorter words are sorted after longer words with the same
>> prefix.
>> ~$ sort testfile
>> abcd Z
>> abce Z
>> abc Z
>> ab Z
>
> $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug
> sort: using `en_US.UTF-8' sorting rules
> abcd Z
> ______
> abce Z
> ______
> abc Z
> _____
> ab Z
> ____
>
> So, what exactly is sort comparing? The entire line (because you didn't specify any -k options to
> limit it to fields). And how does it do the comparison? By strcoll("abcd Z", "abc Z"). And how does
> strcoll() behave in the en_US.UTF-8 locale? By dictionary collation - that is, case and punctuation
> (including space) are ignored. So you get the same answer for both strcoll("abcd Z", "abc Z") and
> for strcoll("abcdz", "abcz") in that locale, and sure enough, d comes before z, so the sort is correct.
>
> You already figured out that LC_ALL=C forces sorting to honor byte values. But if you insist on
> using en_US collation, then maybe you should also look at forcing the sort to honor specific fields:
>
> $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2
> sort: using `en_US.UTF-8' sorting rules
> ab Z
> __
> _
> abc Z
> ___
> _
> abcd Z
> ____
> _
> abce Z
> ____
> _
>
>


-- 
---------------------------------------------------
 Lluís Padró
 Departament de Llenguatges i Sistemes Informàtics
 Centre de Recerca TALP
 UNIVERSITAT POLITÈCNICA DE CATALUNYA
 http://www.lsi.upc.edu/~padro
---------------------------------------------------

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 10 Nov 2011 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 276 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #9740 Bug in sort

GNU bug report logs - #9740
Bug in sort