From debbugs-submit-bounces@debbugs.gnu.org Wed Oct 12 14:48:37 2011 Received: (at submit) by debbugs.gnu.org; 12 Oct 2011 18:48:37 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RE3r3-00047a-CG for submit@debbugs.gnu.org; Wed, 12 Oct 2011 14:48:37 -0400 Received: from eggs.gnu.org ([140.186.70.92]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RE3lF-0003Dz-CM for submit@debbugs.gnu.org; Wed, 12 Oct 2011 14:42:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RE3kl-0007sO-49 for submit@debbugs.gnu.org; Wed, 12 Oct 2011 14:42:08 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD autolearn=unavailable version=3.3.1 Received: from lists.gnu.org ([140.186.70.17]:58537) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RE3kl-0007sD-2f for submit@debbugs.gnu.org; Wed, 12 Oct 2011 14:42:07 -0400 Received: from eggs.gnu.org ([140.186.70.92]:32871) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RE3kj-000744-ON for bug-coreutils@gnu.org; Wed, 12 Oct 2011 14:42:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RE3ki-0007ri-Sv for bug-coreutils@gnu.org; Wed, 12 Oct 2011 14:42:05 -0400 Received: from sebastian.lsi.upc.edu ([147.83.20.13]:33499) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RE3ki-0007qx-FV for bug-coreutils@gnu.org; Wed, 12 Oct 2011 14:42:04 -0400 Received: from leon.ugdsi.upc.edu (leon.lsi.upc.edu [147.83.20.67]) by sebastian.lsi.upc.edu (8.13.8+Sun/8.13.8) with ESMTP id p9CIfxtb009819 for ; Wed, 12 Oct 2011 20:41:59 +0200 (CEST) Received: from [192.168.1.14] (localhost [127.0.0.1]) (authenticated bits=0) by leon.ugdsi.upc.edu (8.13.6/8.13.6/MSA-SMTP-AUTH) with ESMTP id p9CIfriF026314 for ; Wed, 12 Oct 2011 20:41:58 +0200 (CEST) Message-ID: <4E95DF6A.5000703@lsi.upc.edu> Date: Wed, 12 Oct 2011 20:41:46 +0200 From: =?ISO-8859-1?Q?Llu=EDs_Padr=F3?= User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 Thunderbird/3.1.15 MIME-Version: 1.0 To: bug-coreutils@gnu.org Subject: Bug in sort Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 140.186.70.17 X-Spam-Score: -6.0 (------) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Wed, 12 Oct 2011 14:48:37 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.0 (------) I found a bug in the "sort" utility that happens under utf8 locales, though no character beyond basic ascii is involved in it... I'm using "sort (GNU coreutils) 7.4" from package "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS Short reproduction of the error follows below. thank you Lluis ------------------------------------------------ ## test file for "sort" ~$ cat testfile abc Z ab Z abcd Z abce Z ## let's set C locale ~$ export LC_ALL="C" ~$ locale LANG=en_US.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C ## sort works as expected ~$ sort testfile ab Z abc Z abcd Z abce Z ## Let's try another locale ~$ export LC_ALL="en_US.UTF-8" ~$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=en_US.UTF-8 ## Sort fails. Shorter words are sorted after longer words with the same prefix. ~$ sort testfile abcd Z abce Z abc Z ab Z From debbugs-submit-bounces@debbugs.gnu.org Wed Oct 12 15:03:11 2011 Received: (at control) by debbugs.gnu.org; 12 Oct 2011 19:03:12 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RE458-0005Ce-Ko for submit@debbugs.gnu.org; Wed, 12 Oct 2011 15:03:11 -0400 Received: from mx1.redhat.com ([209.132.183.28]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RE44z-0005Bt-7r; Wed, 12 Oct 2011 15:03:04 -0400 Received: from int-mx12.intmail.prod.int.phx2.redhat.com (int-mx12.intmail.prod.int.phx2.redhat.com [10.5.11.25]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p9CJ2V3i014855 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 12 Oct 2011 15:02:31 -0400 Received: from [10.3.113.147] (ovpn-113-147.phx2.redhat.com [10.3.113.147]) by int-mx12.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p9CJ2UsJ028295; Wed, 12 Oct 2011 15:02:30 -0400 Message-ID: <4E95E446.9000402@redhat.com> Date: Wed, 12 Oct 2011 13:02:30 -0600 From: Eric Blake Organization: Red Hat User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110928 Fedora/3.1.15-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.4 Thunderbird/3.1.15 MIME-Version: 1.0 To: =?ISO-8859-1?Q?Llu=EDs_Padr=F3?= Subject: Re: bug#9740: Bug in sort References: <4E95DF6A.5000703@lsi.upc.edu> In-Reply-To: <4E95DF6A.5000703@lsi.upc.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed X-Scanned-By: MIMEDefang 2.68 on 10.5.11.25 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id p9CJ2V3i014855 X-Spam-Score: -10.3 (----------) X-Debbugs-Envelope-To: control Cc: 9740-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -10.3 (----------) tag 9740 notabug thanks On 10/12/2011 12:41 PM, Llu=EDs Padr=F3 wrote: > > I found a bug in the "sort" utility that happens under utf8 locales, th= ough > no character beyond basic ascii is involved in it... Thanks for the report; however, this is almost certainly a case of your=20 locale defining a different collation order than what you were=20 expecting. See the FAQ: https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-= order_0021 > > I'm using "sort (GNU coreutils) 7.4" from package > "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS The latest version of coreutils, 8.14, includes a --debug option that=20 makes it even more apparent why sort is behaving correctly: > ## Let's try another locale > ~$ export LC_ALL=3D"en_US.UTF-8" > ## Sort fails. Shorter words are sorted after longer words with the sam= e > prefix. > ~$ sort testfile > abcd Z > abce Z > abc Z > ab Z $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug sort: using `en_US.UTF-8' sorting rules abcd Z ______ abce Z ______ abc Z _____ ab Z ____ So, what exactly is sort comparing? The entire line (because you didn't=20 specify any -k options to limit it to fields). And how does it do the=20 comparison? By strcoll("abcd Z", "abc Z"). And how does strcoll()=20 behave in the en_US.UTF-8 locale? By dictionary collation - that is,=20 case and punctuation (including space) are ignored. So you get the same=20 answer for both strcoll("abcd Z", "abc Z") and for strcoll("abcdz",=20 "abcz") in that locale, and sure enough, d comes before z, so the sort=20 is correct. You already figured out that LC_ALL=3DC forces sorting to honor byte=20 values. But if you insist on using en_US collation, then maybe you=20 should also look at forcing the sort to honor specific fields: $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2 sort: using `en_US.UTF-8' sorting rules ab Z __ _ abc Z ___ _ abcd Z ____ _ abce Z ____ _ --=20 Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org From debbugs-submit-bounces@debbugs.gnu.org Thu Oct 13 03:29:36 2011 Received: (at 9740-done) by debbugs.gnu.org; 13 Oct 2011 07:29:36 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1REFjT-0006E3-Ha for submit@debbugs.gnu.org; Thu, 13 Oct 2011 03:29:35 -0400 Received: from sebastian.lsi.upc.edu ([147.83.20.13]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1REFjP-0006Dm-6V for 9740-done@debbugs.gnu.org; Thu, 13 Oct 2011 03:29:33 -0400 Received: from leon.ugdsi.upc.edu (leon.lsi.upc.edu [147.83.20.67]) by sebastian.lsi.upc.edu (8.13.8+Sun/8.13.8) with ESMTP id p9D7T1SK023751; Thu, 13 Oct 2011 09:29:01 +0200 (CEST) Received: from [147.83.72.34] (localhost [127.0.0.1]) (authenticated bits=0) by leon.ugdsi.upc.edu (8.13.6/8.13.6/MSA-SMTP-AUTH) with ESMTP id p9D7T08o000460; Thu, 13 Oct 2011 09:29:01 +0200 (CEST) Message-ID: <4E96933C.5080001@lsi.upc.edu> Date: Thu, 13 Oct 2011 09:29:00 +0200 From: =?ISO-8859-15?Q?Llu=EDs_Padr=F3?= User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 Thunderbird/3.1.15 MIME-Version: 1.0 To: Eric Blake Subject: Re: bug#9740: Bug in sort References: <4E95DF6A.5000703@lsi.upc.edu> <4E95E446.9000402@redhat.com> In-Reply-To: <4E95E446.9000402@redhat.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by sebastian.lsi.upc.edu id p9D7T1SK023751 X-Spam-Score: -4.3 (----) X-Debbugs-Envelope-To: 9740-done Cc: 9740-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -3.7 (---) Great, thanks! On 12/10/11 21:02, Eric Blake wrote: > tag 9740 notabug > thanks > > On 10/12/2011 12:41 PM, Llu=EDs Padr=F3 wrote: >> >> I found a bug in the "sort" utility that happens under utf8 locales, t= hough >> no character beyond basic ascii is involved in it... > > Thanks for the report; however, this is almost certainly a case of your= locale defining a different > collation order than what you were expecting. See the FAQ: > https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-norma= l-order_0021 > >> >> I'm using "sort (GNU coreutils) 7.4" from package >> "coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS > > The latest version of coreutils, 8.14, includes a --debug option that m= akes it even more apparent > why sort is behaving correctly: > >> ## Let's try another locale >> ~$ export LC_ALL=3D"en_US.UTF-8" > >> ## Sort fails. Shorter words are sorted after longer words with the sa= me >> prefix. >> ~$ sort testfile >> abcd Z >> abce Z >> abc Z >> ab Z > > $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug > sort: using `en_US.UTF-8' sorting rules > abcd Z > ______ > abce Z > ______ > abc Z > _____ > ab Z > ____ > > So, what exactly is sort comparing? The entire line (because you didn't= specify any -k options to > limit it to fields). And how does it do the comparison? By strcoll("abc= d Z", "abc Z"). And how does > strcoll() behave in the en_US.UTF-8 locale? By dictionary collation - t= hat is, case and punctuation > (including space) are ignored. So you get the same answer for both strc= oll("abcd Z", "abc Z") and > for strcoll("abcdz", "abcz") in that locale, and sure enough, d comes b= efore z, so the sort is correct. > > You already figured out that LC_ALL=3DC forces sorting to honor byte va= lues. But if you insist on > using en_US collation, then maybe you should also look at forcing the s= ort to honor specific fields: > > $ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2 > sort: using `en_US.UTF-8' sorting rules > ab Z > __ > _ > abc Z > ___ > _ > abcd Z > ____ > _ > abce Z > ____ > _ > > --=20 --------------------------------------------------- Llu=EDs Padr=F3 Departament de Llenguatges i Sistemes Inform=E0tics Centre de Recerca TALP UNIVERSITAT POLIT=C8CNICA DE CATALUNYA http://www.lsi.upc.edu/~padro --------------------------------------------------- From unknown Mon Jun 23 00:36:19 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Thu, 10 Nov 2011 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator