#9740 - Bug in sort - GNU bug report logs

GNU bug report logs - #9740
Bug in sort

Reported by: Lluís Padró <padro <at> lsi.upc.edu>

Date: Wed, 12 Oct 2011 18:49:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Message #13 received at 9740-done <at> debbugs.gnu.org (full text, mbox):

From: Lluís Padró <padro <at> lsi.upc.edu> To: Eric Blake <eblake <at> redhat.com> Cc: 9740-done <at> debbugs.gnu.org Subject: Re: bug#9740: Bug in sort Date: Thu, 13 Oct 2011 09:29:00 +0200

Great, thanks! On 12/10/11 21:02, Eric Blake wrote: > tag 9740 notabug > thanks > > On 10/12/2011 12:41 PM, Lluís Padró wrote: >> >> I found a bug in the "sort" utility that happens under utf8 locales, though >> no character beyond basic ascii is involved in it... > > Thanks for the report; however, this is almost certainly a case of your locale defining a different > collation order than what you were expecting. See the FAQ: > https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 > >> >> I'm using "sort (GNU coreutils) 7.4" from package >> "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS > > The latest version of coreutils, 8.14, includes a --debug option that makes it even more apparent > why sort is behaving correctly: > >> ## Let's try another locale >> ~$ export LC_ALL="en_US.UTF-8" > >> ## Sort fails. Shorter words are sorted after longer words with the same >> prefix. >> ~$ sort testfile >> abcd Z >> abce Z >> abc Z >> ab Z > > $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug > sort: using `en_US.UTF-8' sorting rules > abcd Z > ______ > abce Z > ______ > abc Z > _____ > ab Z > ____ > > So, what exactly is sort comparing? The entire line (because you didn't specify any -k options to > limit it to fields). And how does it do the comparison? By strcoll("abcd Z", "abc Z"). And how does > strcoll() behave in the en_US.UTF-8 locale? By dictionary collation - that is, case and punctuation > (including space) are ignored. So you get the same answer for both strcoll("abcd Z", "abc Z") and > for strcoll("abcdz", "abcz") in that locale, and sure enough, d comes before z, so the sort is correct. > > You already figured out that LC_ALL=C forces sorting to honor byte values. But if you insist on > using en_US collation, then maybe you should also look at forcing the sort to honor specific fields: > > $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2 > sort: using `en_US.UTF-8' sorting rules > ab Z > __ > _ > abc Z > ___ > _ > abcd Z > ____ > _ > abce Z > ____ > _ > > -- --------------------------------------------------- Lluís Padró Departament de Llenguatges i Sistemes Informàtics Centre de Recerca TALP UNIVERSITAT POLITÈCNICA DE CATALUNYA http://www.lsi.upc.edu/~padro ---------------------------------------------------

This bug report was last modified 13 years and 276 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #9740 Bug in sort

GNU bug report logs - #9740
Bug in sort