GNU bug report logs -
#9740
Bug in sort
Previous Next
Reported by: Lluís Padró <padro <at> lsi.upc.edu>
Date: Wed, 12 Oct 2011 18:49:02 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Wed, 12 Oct 2011 13:02:30 -0600
with message-id <4E95E446.9000402 <at> redhat.com>
and subject line Re: bug#9740: Bug in sort
has caused the debbugs.gnu.org bug report #9740,
regarding Bug in sort
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
9740: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=9740
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
I found a bug in the "sort" utility that happens under utf8 locales, though
no character beyond basic ascii is involved in it...
I'm using "sort (GNU coreutils) 7.4" from package
"coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS
Short reproduction of the error follows below.
thank you
Lluis
------------------------------------------------
## test file for "sort"
~$ cat testfile
abc Z
ab Z
abcd Z
abce Z
## let's set C locale
~$ export LC_ALL="C"
~$ locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C
## sort works as expected
~$ sort testfile
ab Z
abc Z
abcd Z
abce Z
## Let's try another locale
~$ export LC_ALL="en_US.UTF-8"
~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
## Sort fails. Shorter words are sorted after longer words with the
same prefix.
~$ sort testfile
abcd Z
abce Z
abc Z
ab Z
[Message part 3 (message/rfc822, inline)]
tag 9740 notabug
thanks
On 10/12/2011 12:41 PM, Lluís Padró wrote:
>
> I found a bug in the "sort" utility that happens under utf8 locales, though
> no character beyond basic ascii is involved in it...
Thanks for the report; however, this is almost certainly a case of your
locale defining a different collation order than what you were
expecting. See the FAQ:
https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
>
> I'm using "sort (GNU coreutils) 7.4" from package
> "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS
The latest version of coreutils, 8.14, includes a --debug option that
makes it even more apparent why sort is behaving correctly:
> ## Let's try another locale
> ~$ export LC_ALL="en_US.UTF-8"
> ## Sort fails. Shorter words are sorted after longer words with the same
> prefix.
> ~$ sort testfile
> abcd Z
> abce Z
> abc Z
> ab Z
$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug
sort: using `en_US.UTF-8' sorting rules
abcd Z
______
abce Z
______
abc Z
_____
ab Z
____
So, what exactly is sort comparing? The entire line (because you didn't
specify any -k options to limit it to fields). And how does it do the
comparison? By strcoll("abcd Z", "abc Z"). And how does strcoll()
behave in the en_US.UTF-8 locale? By dictionary collation - that is,
case and punctuation (including space) are ignored. So you get the same
answer for both strcoll("abcd Z", "abc Z") and for strcoll("abcdz",
"abcz") in that locale, and sure enough, d comes before z, so the sort
is correct.
You already figured out that LC_ALL=C forces sorting to honor byte
values. But if you insist on using en_US collation, then maybe you
should also look at forcing the sort to honor specific fields:
$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2
sort: using `en_US.UTF-8' sorting rules
ab Z
__
_
abc Z
___
_
abcd Z
____
_
abce Z
____
_
--
Eric Blake eblake <at> redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
This bug report was last modified 13 years and 228 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.