From unknown Fri Sep 05 08:19:44 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#6327 <6327@debbugs.gnu.org> To: bug#6327 <6327@debbugs.gnu.org> Subject: Status: sort fails on some UTF-8 input Reply-To: bug#6327 <6327@debbugs.gnu.org> Date: Fri, 05 Sep 2025 15:19:44 +0000 retitle 6327 sort fails on some UTF-8 input reassign 6327 coreutils submitter 6327 River Tarnell severity 6327 normal tag 6327 notabug thanks From debbugs-submit-bounces@debbugs.gnu.org Wed Jun 02 03:39:07 2010 Received: (at submit) by debbugs.gnu.org; 2 Jun 2010 07:39:07 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJiXa-0007WR-Sc for submit@debbugs.gnu.org; Wed, 02 Jun 2010 03:39:07 -0400 Received: from mail.gnu.org ([199.232.76.166] helo=mx10.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJfvc-0006Lu-Ah for submit@debbugs.gnu.org; Wed, 02 Jun 2010 00:51:44 -0400 Received: from lists.gnu.org ([199.232.76.165]:37933) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1OJfvY-0007kB-V4 for submit@debbugs.gnu.org; Wed, 02 Jun 2010 00:51:41 -0400 Received: from [140.186.70.92] (port=50626 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OJfvT-00084O-N7 for bug-coreutils@gnu.org; Wed, 02 Jun 2010 00:51:40 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.1 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OJfvS-0001qj-MJ for bug-coreutils@gnu.org; Wed, 02 Jun 2010 00:51:35 -0400 Received: from loreley.tcx.org.uk ([81.187.4.82]:51846) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJfvS-0001pt-H2 for bug-coreutils@gnu.org; Wed, 02 Jun 2010 00:51:34 -0400 Received: by LORELEY.TCX.ORG.UK (Postfix, from userid 106) id 5742522E20; Wed, 2 Jun 2010 05:51:25 +0100 (BST) Date: Wed, 2 Jun 2010 05:51:25 +0100 From: River Tarnell To: bug-coreutils@gnu.org Subject: sort fails on some UTF-8 input Message-ID: <20100602045125.GC28776@loreley.TCX.ORG.UK> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="tsOsTdHNUZQcU9Ye" Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: HP-UX 11.00-11.11 X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Wed, 02 Jun 2010 03:39:05 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -5.3 (-----) --tsOsTdHNUZQcU9Ye Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable I'm using coreutils 8.5 on Solaris 10. GNU 'sort' fails to sort some input, while Solaris 'sort' handles it correctly: willow% /opt/ts/gnu/bin/sort sort_test.txt=20 /opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence /opt/ts/gnu/bin/sort: Set LC_ALL=3D'C' to work around the problem. /opt/ts/gnu/bin/sort: The strings compared were `\360\222\203\276\360\222\205\226' and `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'. willow% /usr/bin/sort sort_test.txt=20 =F0=92=83=BE=F0=92=85=96 =F0=92=80=AD=F0=92=8B=AB=F0=92=8B=AB=F0=92=80=AD willow%=20 I've attached the example file sort_test.txt. - river. --tsOsTdHNUZQcU9Ye Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="sort_test.txt" Content-Transfer-Encoding: quoted-printable =F0=92=83=BE=F0=92=85=96 =F0=92=80=AD=F0=92=8B=AB=F0=92=8B=AB=F0=92=80=AD --tsOsTdHNUZQcU9Ye-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jun 02 10:40:57 2010 Received: (at 6327) by debbugs.gnu.org; 2 Jun 2010 14:40:57 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJp7o-0003Nn-PP for submit@debbugs.gnu.org; Wed, 02 Jun 2010 10:40:57 -0400 Received: from mx1.redhat.com ([209.132.183.28]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJp7m-0003Ni-MN for 6327@debbugs.gnu.org; Wed, 02 Jun 2010 10:40:56 -0400 Received: from int-mx05.intmail.prod.int.phx2.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.18]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o52EemvO030936 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 2 Jun 2010 10:40:48 -0400 Received: from [10.3.227.83] (vpn-227-83.phx2.redhat.com [10.3.227.83]) by int-mx05.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o52Eel8f023242; Wed, 2 Jun 2010 10:40:47 -0400 Message-ID: <4C066D53.5000800@redhat.com> Date: Wed, 02 Jun 2010 08:40:19 -0600 From: Eric Blake Organization: Red Hat User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100430 Fedora/3.0.4-3.fc13 Lightning/1.0b2pre Mnenhy/0.8.2 Thunderbird/3.0.4 MIME-Version: 1.0 To: River Tarnell Subject: Re: bug#6327: sort fails on some UTF-8 input References: <20100602045125.GC28776@loreley.TCX.ORG.UK> In-Reply-To: <20100602045125.GC28776@loreley.TCX.ORG.UK> X-Enigmail-Version: 1.0.1 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enig8033657C6ED75899809EE6ED" X-Scanned-By: MIMEDefang 2.67 on 10.5.11.18 X-Spam-Score: -8.8 (--------) X-Debbugs-Envelope-To: 6327 Cc: 6327@debbugs.gnu.org, bug-gnulib X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -10.1 (----------) This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig8033657C6ED75899809EE6ED Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable [adding gnulib] On 06/01/2010 10:51 PM, River Tarnell wrote: > I'm using coreutils 8.5 on Solaris 10. >=20 > GNU 'sort' fails to sort some input, while Solaris 'sort' handles it > correctly: >=20 > willow% /opt/ts/gnu/bin/sort sort_test.txt=20 > /opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence > /opt/ts/gnu/bin/sort: Set LC_ALL=3D'C' to work around the problem. > /opt/ts/gnu/bin/sort: The strings compared were > `\360\222\203\276\360\222\205\226' and > `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'. Thanks for the report. What locale are you using (that is, the entire output of 'locale')? I could not reproduce failure using: $ export LC_ALL; for f in $(locale -a); do LC_ALL=3D$f || continue; sort sort_test.txt >/dev/null || { echo $f; break; }; done on a GNU/Linux system with 732 installed locales. But it is highly likely that you could be in a non-UTF-8 locale, or that the Solaris multibyte functions are not as robust as glibc at detecting valid UTF-8 sequences. If it is indeed a bug in Solaris strcoll(), then gnulib can probably be taught to work around it. --=20 Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org --------------enig8033657C6ED75899809EE6ED Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBCAAGBQJMBm1TAAoJEKeha0olJ0NqvAEH/21yWVWa5alI7Ac9xJdjY1e/ r/aN0BBmBZaNhftZT9pEblrxWC74kavFG4maVmYpi1b1PBD4Zo2Kh1MfiVNXkGhR 8+XwNcfXMrcMB980/nUaIpGMDPI0j3BLlSybPOKuUtsLtmIaa7Y8NtC47TuQITgX nE37nf5be/Jrl5tnEmqQ8FMX7dzDSzAz425NYhpHHABT41MB/iOcIJfacMp7RHIV GezvqOkpVh5a78Z1yZ6CAnIeHnURzWZ9IEHavrL8M4kfBxDcwEM0owM2jr0LXo6Z C2TI+zjp+wM2tKyQg2d/MUQLO8zOdjL4NwwJLby9Yh6IT6jFb1SKqlHCqRR/xCU= =cd7f -----END PGP SIGNATURE----- --------------enig8033657C6ED75899809EE6ED-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jun 02 11:32:10 2010 Received: (at 6327) by debbugs.gnu.org; 2 Jun 2010 15:32:10 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJpvO-0003jg-46 for submit@debbugs.gnu.org; Wed, 02 Jun 2010 11:32:10 -0400 Received: from mail1.slb.deg.dub.stisp.net ([84.203.253.98]) by debbugs.gnu.org with smtp (Exim 4.69) (envelope-from ) id 1OJpvM-0003ja-5j for 6327@debbugs.gnu.org; Wed, 02 Jun 2010 11:32:08 -0400 Received: (qmail 70818 invoked from network); 2 Jun 2010 15:32:02 -0000 Received: from unknown (HELO ?192.168.2.25?) (84.203.137.218) by mail1.slb.deg.dub.stisp.net with SMTP; 2 Jun 2010 15:32:02 -0000 Message-ID: <4C067968.7040102@draigBrady.com> Date: Wed, 02 Jun 2010 16:31:52 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 MIME-Version: 1.0 To: River Tarnell Subject: Re: bug#6327: sort fails on some UTF-8 input References: <20100602045125.GC28776@loreley.TCX.ORG.UK> In-Reply-To: <20100602045125.GC28776@loreley.TCX.ORG.UK> X-Enigmail-Version: 1.0.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.8 (--) X-Debbugs-Envelope-To: 6327 Cc: 6327@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.8 (--) On 02/06/10 05:51, River Tarnell wrote: > I'm using coreutils 8.5 on Solaris 10. > > GNU 'sort' fails to sort some input, while Solaris 'sort' handles it > correctly: > > willow% /opt/ts/gnu/bin/sort sort_test.txt > /opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence > /opt/ts/gnu/bin/sort: Set LC_ALL='C' to work around the problem. > /opt/ts/gnu/bin/sort: The strings compared were > `\360\222\203\276\360\222\205\226' and > `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'. > willow% /usr/bin/sort sort_test.txt > 𒃾𒅖 > 𒀭𒋫𒋫𒀭 > willow% > > I've attached the example file sort_test.txt. I'm not sure what those characters are, but they're valid UTF8 and my linux system here has no issue with sorting them. Note we just use strcoll() to do the comparison. What strcoll() are you linking against? cheers, Pádraig. From debbugs-submit-bounces@debbugs.gnu.org Wed Jun 02 15:38:08 2010 Received: (at 6327) by debbugs.gnu.org; 2 Jun 2010 19:38:08 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJtlP-0005VZ-Rv for submit@debbugs.gnu.org; Wed, 02 Jun 2010 15:38:08 -0400 Received: from kiwi.cs.ucla.edu ([131.179.128.19]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJtlN-0005VC-3P for 6327@debbugs.gnu.org; Wed, 02 Jun 2010 15:38:06 -0400 Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by kiwi.cs.ucla.edu (8.13.8+Sun/8.13.8/UCLACS-6.0) with ESMTP id o52Jbwpx003485; Wed, 2 Jun 2010 12:37:59 -0700 (PDT) Message-ID: <4C06B316.6060804@cs.ucla.edu> Date: Wed, 02 Jun 2010 12:37:58 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4 MIME-Version: 1.0 To: River Tarnell Subject: Re: bug#6327: sort fails on some UTF-8 input References: <20100602045125.GC28776@loreley.TCX.ORG.UK> In-Reply-To: <20100602045125.GC28776@loreley.TCX.ORG.UK> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Score: -2.4 (--) X-Debbugs-Envelope-To: 6327 Cc: 6327@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.5 (--) On 06/01/2010 09:51 PM, River Tarnell wrote: > I'm using coreutils 8.5 on Solaris 10. > > GNU 'sort' fails to sort some input, while Solaris 'sort' handles it > correctly: Amusingly enough, on that same test case I found the same problem with GNU 'sort' that you did, but I also found that Solaris 'sort' reports that it runs out of memory, even in 64-bit mode. For example: 1010-kiwi $ LC_ALL=en_CA.UTF-8 /usr/bin/sparcv9/sort sort_test.txt sort: insufficient memory; use -S option to increase allocation 1011-kiwi $ LC_ALL=en_CA.UTF-8 coreutils-8.5/src/sort sort_test.txt coreutils-8.5/src/sort: string comparison failed: Illegal byte sequence coreutils-8.5/src/sort: Set LC_ALL='C' to work around the problem. coreutils-8.5/src/sort: The strings compared were `\360\222\203\276\360\222\205\226' and `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'. I expect that the exact failure mode probably depends on the locale (and/or whether you're using x86 or sparc), and that GNU 'sort' checks for strcoll failures but Solaris 'sort' does not (and thus crashes). If my guess is right, this appears to be a bug in the Solaris strcoll implementation. I don't see a simple workaround. You might file a bug report with Sun. From debbugs-submit-bounces@debbugs.gnu.org Mon Aug 08 02:29:53 2011 Received: (at control) by debbugs.gnu.org; 8 Aug 2011 06:29:53 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QqJLQ-0000Jm-IV for submit@debbugs.gnu.org; Mon, 08 Aug 2011 02:29:53 -0400 Received: from mx.meyering.net ([82.230.74.64]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QqJLN-0000Jd-TI for control@debbugs.gnu.org; Mon, 08 Aug 2011 02:29:46 -0400 Received: from rho.meyering.net (localhost.localdomain [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id 4411D60098 for ; Mon, 8 Aug 2011 08:28:42 +0200 (CEST) From: Jim Meyering To: control@debbugs.gnu.org Subject: notabug Date: Mon, 08 Aug 2011 08:28:42 +0200 Message-ID: <87sjpce7s5.fsf@rho.meyering.net> Lines: 2 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -6.1 (------) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.1 (------) tags 6327 + notabug From debbugs-submit-bounces@debbugs.gnu.org Mon Aug 08 02:29:02 2011 Received: (at 6327-done) by debbugs.gnu.org; 8 Aug 2011 06:29:02 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QqJKf-0000If-Hs for submit@debbugs.gnu.org; Mon, 08 Aug 2011 02:29:02 -0400 Received: from mx.meyering.net ([82.230.74.64]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QqJKc-0000IP-8r for 6327-done@debbugs.gnu.org; Mon, 08 Aug 2011 02:28:59 -0400 Received: from rho.meyering.net (localhost.localdomain [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id E14A560098; Mon, 8 Aug 2011 08:27:56 +0200 (CEST) From: Jim Meyering To: River Tarnell Subject: Re: bug#6327: sort fails on some UTF-8 input In-Reply-To: <20100602045125.GC28776@loreley.TCX.ORG.UK> (River Tarnell's message of "Wed, 2 Jun 2010 05:51:25 +0100") References: <20100602045125.GC28776@loreley.TCX.ORG.UK> Date: Mon, 08 Aug 2011 08:27:56 +0200 Message-ID: <87y5z4e7tf.fsf@rho.meyering.net> Lines: 25 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -6.1 (------) X-Debbugs-Envelope-To: 6327-done Cc: bug-gnulib@gnu.org, 6327-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.1 (------) River Tarnell wrote: > I'm using coreutils 8.5 on Solaris 10. > > GNU 'sort' fails to sort some input, while Solaris 'sort' handles it > correctly: > > willow% /opt/ts/gnu/bin/sort sort_test.txt > /opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence > /opt/ts/gnu/bin/sort: Set LC_ALL=3D'C' to work around the problem. > /opt/ts/gnu/bin/sort: The strings compared were > `\360\222\203\276\360\222\205\226' and > `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'. > willow% /usr/bin/sort sort_test.txt > =F0=92=83=BE=F0=92=85=96 > =F0=92=80=AD=F0=92=8B=AB=F0=92=8B=AB=F0=92=80=AD > willow% > > I've attached the example file sort_test.txt. Thanks for the report. Since this appears not to be due to any problem with GNU sort per se, but rather with solaris' strcoll implementation, I'm closing this coreutils "issue" and Cc'ing bug-gnulib, in case someone there wants to pursue the strcoll-replacement approach. From unknown Fri Sep 05 08:19:44 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 05 Sep 2011 11:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator