GNU bug report logs - #6327
sort fails on some UTF-8 input

Previous Next

Package: coreutils;

Reported by: River Tarnell <river.tarnell <at> wikimedia.de>

Date: Wed, 2 Jun 2010 07:40:03 UTC

Severity: normal

Tags: notabug

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6327 in the body.
You can then email your comments to 6327 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6327; Package coreutils. (Wed, 02 Jun 2010 07:40:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to River Tarnell <river.tarnell <at> wikimedia.de>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 02 Jun 2010 07:40:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: River Tarnell <river.tarnell <at> wikimedia.de>
To: bug-coreutils <at> gnu.org
Subject: sort fails on some UTF-8 input
Date: Wed, 2 Jun 2010 05:51:25 +0100
[Message part 1 (text/plain, inline)]
I'm using coreutils 8.5 on Solaris 10.

GNU 'sort' fails to sort some input, while Solaris 'sort' handles it
correctly:

willow% /opt/ts/gnu/bin/sort sort_test.txt 
/opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence
/opt/ts/gnu/bin/sort: Set LC_ALL='C' to work around the problem.
/opt/ts/gnu/bin/sort: The strings compared were
`\360\222\203\276\360\222\205\226' and
`\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'.
willow% /usr/bin/sort sort_test.txt 
π’ƒΎπ’…–
π’€­π’‹«π’‹«π’€­
willow% 

I've attached the example file sort_test.txt.

	- river.
[sort_test.txt (text/plain, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6327; Package coreutils. (Wed, 02 Jun 2010 14:41:01 GMT) Full text and rfc822 format available.

Message #8 received at 6327 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: River Tarnell <river.tarnell <at> wikimedia.de>
Cc: 6327 <at> debbugs.gnu.org, bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: bug#6327: sort fails on some UTF-8 input
Date: Wed, 02 Jun 2010 08:40:19 -0600
[Message part 1 (text/plain, inline)]
[adding gnulib]

On 06/01/2010 10:51 PM, River Tarnell wrote:
> I'm using coreutils 8.5 on Solaris 10.
> 
> GNU 'sort' fails to sort some input, while Solaris 'sort' handles it
> correctly:
> 
> willow% /opt/ts/gnu/bin/sort sort_test.txt 
> /opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence
> /opt/ts/gnu/bin/sort: Set LC_ALL='C' to work around the problem.
> /opt/ts/gnu/bin/sort: The strings compared were
> `\360\222\203\276\360\222\205\226' and
> `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'.

Thanks for the report.  What locale are you using (that is, the entire
output of 'locale')?  I could not reproduce failure using:

$ export LC_ALL; for f in $(locale -a); do LC_ALL=$f || continue;
    sort sort_test.txt >/dev/null || { echo $f; break; }; done

on a GNU/Linux system with 732 installed locales.  But it is highly
likely that you could be in a non-UTF-8 locale, or that the Solaris
multibyte functions are not as robust as glibc at detecting valid UTF-8
sequences.  If it is indeed a bug in Solaris strcoll(), then gnulib can
probably be taught to work around it.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6327; Package coreutils. (Wed, 02 Jun 2010 15:33:02 GMT) Full text and rfc822 format available.

Message #11 received at 6327 <at> debbugs.gnu.org (full text, mbox):

From: PΓ‘draig Brady <P <at> draigBrady.com>
To: River Tarnell <river.tarnell <at> wikimedia.de>
Cc: 6327 <at> debbugs.gnu.org
Subject: Re: bug#6327: sort fails on some UTF-8 input
Date: Wed, 02 Jun 2010 16:31:52 +0100
On 02/06/10 05:51, River Tarnell wrote:
> I'm using coreutils 8.5 on Solaris 10.
> 
> GNU 'sort' fails to sort some input, while Solaris 'sort' handles it
> correctly:
> 
> willow% /opt/ts/gnu/bin/sort sort_test.txt 
> /opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence
> /opt/ts/gnu/bin/sort: Set LC_ALL='C' to work around the problem.
> /opt/ts/gnu/bin/sort: The strings compared were
> `\360\222\203\276\360\222\205\226' and
> `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'.
> willow% /usr/bin/sort sort_test.txt 
> π’ƒΎπ’…–
> π’€­π’‹«π’‹«π’€­
> willow% 
> 
> I've attached the example file sort_test.txt.

I'm not sure what those characters are, but they're valid UTF8
and my linux system here has no issue with sorting them.
Note we just use strcoll() to do the comparison.
What strcoll() are you linking against?

cheers,
PΓ‘draig.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6327; Package coreutils. (Wed, 02 Jun 2010 19:39:01 GMT) Full text and rfc822 format available.

Message #14 received at 6327 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: River Tarnell <river.tarnell <at> wikimedia.de>
Cc: 6327 <at> debbugs.gnu.org
Subject: Re: bug#6327: sort fails on some UTF-8 input
Date: Wed, 02 Jun 2010 12:37:58 -0700
On 06/01/2010 09:51 PM, River Tarnell wrote:
> I'm using coreutils 8.5 on Solaris 10.
> 
> GNU 'sort' fails to sort some input, while Solaris 'sort' handles it
> correctly:

Amusingly enough, on that same test case I found the same problem
with GNU 'sort' that you did, but I also found that Solaris 'sort'
reports that it runs out of memory, even in 64-bit mode.  For example:

1010-kiwi $ LC_ALL=en_CA.UTF-8 /usr/bin/sparcv9/sort sort_test.txt 
sort: insufficient memory; use -S option to increase allocation
1011-kiwi $ LC_ALL=en_CA.UTF-8 coreutils-8.5/src/sort sort_test.txt
coreutils-8.5/src/sort: string comparison failed: Illegal byte sequence
coreutils-8.5/src/sort: Set LC_ALL='C' to work around the problem.
coreutils-8.5/src/sort: The strings compared were `\360\222\203\276\360\222\205\226' and `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'.

I expect that the exact failure mode probably depends on the
locale (and/or whether you're using x86 or sparc),
and that GNU 'sort' checks for strcoll failures but
Solaris 'sort' does not (and thus crashes).  If my guess is right,
this appears to be a bug in the Solaris strcoll implementation.
I don't see a simple workaround.  You might file a bug report
with Sun.




Added tag(s) notabug. Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Mon, 08 Aug 2011 06:30:02 GMT) Full text and rfc822 format available.

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Mon, 08 Aug 2011 06:30:04 GMT) Full text and rfc822 format available.

Notification sent to River Tarnell <river.tarnell <at> wikimedia.de>:
bug acknowledged by developer. (Mon, 08 Aug 2011 06:30:04 GMT) Full text and rfc822 format available.

Message #21 received at 6327-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: River Tarnell <river.tarnell <at> wikimedia.de>
Cc: bug-gnulib <at> gnu.org, 6327-done <at> debbugs.gnu.org
Subject: Re: bug#6327: sort fails on some UTF-8 input
Date: Mon, 08 Aug 2011 08:27:56 +0200
River Tarnell wrote:
> I'm using coreutils 8.5 on Solaris 10.
>
> GNU 'sort' fails to sort some input, while Solaris 'sort' handles it
> correctly:
>
> willow% /opt/ts/gnu/bin/sort sort_test.txt
> /opt/ts/gnu/bin/sort: string comparison failed: Illegal byte sequence
> /opt/ts/gnu/bin/sort: Set LC_ALL='C' to work around the problem.
> /opt/ts/gnu/bin/sort: The strings compared were
> `\360\222\203\276\360\222\205\226' and
> `\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255'.
> willow% /usr/bin/sort sort_test.txt
> π’ƒΎπ’…–
> π’€­π’‹«π’‹«π’€­
> willow%
>
> I've attached the example file sort_test.txt.

Thanks for the report.
Since this appears not to be due to any problem
with GNU sort per se, but rather with solaris'
strcoll implementation, I'm closing this coreutils "issue"
and Cc'ing bug-gnulib, in case someone there wants to
pursue the strcoll-replacement approach.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 05 Sep 2011 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 351 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.