#42986 - sort: possible bug when sorting special characters

GNU bug report logs - #42986
sort: possible bug when sorting special characters

Reported by: "Wolter H. V." <wolterhv <at> gmx.de>

Date: Sat, 22 Aug 2020 15:38:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Message #8 received at 42986 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com> To: "Wolter H. V." <wolterhv <at> gmx.de>, 42986 <at> debbugs.gnu.org Subject: Re: bug#42986: sort: possible bug when sorting special characters Date: Sat, 22 Aug 2020 10:51:23 -0500

tag 42986 notabug thanks On 8/22/20 6:46 AM, Wolter H. V. wrote: > The following commands: > > echo 'Pará,9\nParacito,0' | sort --field-separator=, -k1 Use of echo with \ is non-portable, more portable is to use printf. > > and > > echo 'Pará,Z\nParacito,A' | sort --field-separator=, -k1 Using -k1 (rather than -k1,1) says to use the entire remainder of the line in the sort field comparison. Furthermore, sorting is locale dependent, and some locales treat punctuation as insignificant in the collation process. You can see this yourself by using the --debug option: $ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1 --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules Pará,9 ______ ______ Paracito,0 __________ __________ In the en_US.UTF-8 locale, commas and accents are ignored, and since you did not end the field at the first comma, you end up getting the same sort as 'Para9' vs. 'Parac', where 9 sorts before c. $ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1,1 --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules Pará,9 ____ ______ Paracito,0 ________ __________ In the same locale, but using a more limited field, you now have two prefixes 'Para' that compare identically, so the shorter string sorts first. $ printf 'Pará,9\nParacito,0\n' | LC_ALL=C sort --field-separator=, -k1 --debug sort: text ordering performed using simple byte comparison Paracito,0 __________ __________ Pará,9 _______ _______ In the C locale, every byte sorts distinct, so accents become important, and 'a' sorts before 'á'. > > give > > Pará,9 > Paracito,0 > > and > > Paracito,A > Pará,Z > > respectively. $ printf 'Pará,Z\nParacito,A\n' | sort --field-separator=, -k1,1 --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules Pará,Z ____ ______ Paracito,A ________ __________ Forcing the shorter sort field by using -k1,1 gets the results you seem to be looking for. > > Sorting the string 'á\na' results in 'a\ná', so I would expect the commands above to put Paracito before Pará, but this is not the case for the first command. Why is that? Rather, you were probably sorting in a locale where 'a' and 'á' collate identically, to the point where the tie was broken by a later point in the line. At any rate, since sort is behaving as required by POSIX by honoring your locale, and the --debug option lets you see what is going on, I see nothing to fix, so I'm marking this as not a bug. However, feel free to respond with further followups. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org

This bug report was last modified 4 years and 329 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #42986 sort: possible bug when sorting special characters

GNU bug report logs - #42986
sort: possible bug when sorting special characters