GNU bug report logs -
#42986
sort: possible bug when sorting special characters
Previous Next
Reported by: "Wolter H. V." <wolterhv <at> gmx.de>
Date: Sat, 22 Aug 2020 15:38:02 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
tag 42986 notabug
thanks
On 8/22/20 6:46 AM, Wolter H. V. wrote:
> The following commands:
>
> echo 'Pará,9\nParacito,0' | sort --field-separator=, -k1
Use of echo with \ is non-portable, more portable is to use printf.
>
> and
>
> echo 'Pará,Z\nParacito,A' | sort --field-separator=, -k1
Using -k1 (rather than -k1,1) says to use the entire remainder of the
line in the sort field comparison. Furthermore, sorting is locale
dependent, and some locales treat punctuation as insignificant in the
collation process. You can see this yourself by using the --debug option:
$ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,9
______
______
Paracito,0
__________
__________
In the en_US.UTF-8 locale, commas and accents are ignored, and since you
did not end the field at the first comma, you end up getting the same
sort as 'Para9' vs. 'Parac', where 9 sorts before c.
$ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1,1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,9
____
______
Paracito,0
________
__________
In the same locale, but using a more limited field, you now have two
prefixes 'Para' that compare identically, so the shorter string sorts first.
$ printf 'Pará,9\nParacito,0\n' | LC_ALL=C sort --field-separator=, -k1
--debug
sort: text ordering performed using simple byte comparison
Paracito,0
__________
__________
Pará,9
_______
_______
In the C locale, every byte sorts distinct, so accents become important,
and 'a' sorts before 'á'.
>
> give
>
> Pará,9
> Paracito,0
>
> and
>
> Paracito,A
> Pará,Z
>
> respectively.
$ printf 'Pará,Z\nParacito,A\n' | sort --field-separator=, -k1,1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,Z
____
______
Paracito,A
________
__________
Forcing the shorter sort field by using -k1,1 gets the results you seem
to be looking for.
>
> Sorting the string 'á\na' results in 'a\ná', so I would expect the commands above to put Paracito before Pará, but this is not the case for the first command. Why is that?
Rather, you were probably sorting in a locale where 'a' and 'á' collate
identically, to the point where the tie was broken by a later point in
the line.
At any rate, since sort is behaving as required by POSIX by honoring
your locale, and the --debug option lets you see what is going on, I see
nothing to fix, so I'm marking this as not a bug. However, feel free to
respond with further followups.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
This bug report was last modified 4 years and 274 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.