GNU bug report logs - #42986
sort: possible bug when sorting special characters

Previous Next

Package: coreutils;

Reported by: "Wolter H. V." <wolterhv <at> gmx.de>

Date: Sat, 22 Aug 2020 15:38:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 42986 in the body.
You can then email your comments to 42986 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#42986; Package coreutils. (Sat, 22 Aug 2020 15:38:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Wolter H. V." <wolterhv <at> gmx.de>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 22 Aug 2020 15:38:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Wolter H. V." <wolterhv <at> gmx.de>
To: bug-coreutils <at> gnu.org
Subject: sort: possible bug when sorting special characters
Date: Sat, 22 Aug 2020 12:46:19 +0100
The following commands:

    echo 'Pará,9\nParacito,0' | sort --field-separator=, -k1

and

    echo 'Pará,Z\nParacito,A' | sort --field-separator=, -k1

give

    Pará,9
    Paracito,0

and

    Paracito,A
    Pará,Z

respectively.

Sorting the string 'á\na' results in 'a\ná', so I would expect the commands above to put Paracito before Pará, but this is not the case for the first command. Why is that?

Regards,

Wolter HV





Information forwarded to bug-coreutils <at> gnu.org:
bug#42986; Package coreutils. (Sat, 22 Aug 2020 15:52:01 GMT) Full text and rfc822 format available.

Message #8 received at 42986 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: "Wolter H. V." <wolterhv <at> gmx.de>, 42986 <at> debbugs.gnu.org
Subject: Re: bug#42986: sort: possible bug when sorting special characters
Date: Sat, 22 Aug 2020 10:51:23 -0500
tag 42986 notabug
thanks

On 8/22/20 6:46 AM, Wolter H. V. wrote:
> The following commands:
> 
>      echo 'Pará,9\nParacito,0' | sort --field-separator=, -k1

Use of echo with \ is non-portable, more portable is to use printf.

> 
> and
> 
>      echo 'Pará,Z\nParacito,A' | sort --field-separator=, -k1

Using -k1 (rather than -k1,1) says to use the entire remainder of the 
line in the sort field comparison.  Furthermore, sorting is locale 
dependent, and some locales treat punctuation as insignificant in the 
collation process.  You can see this yourself by using the --debug option:

$ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,9
______
______
Paracito,0
__________
__________

In the en_US.UTF-8 locale, commas and accents are ignored, and since you 
did not end the field at the first comma, you end up getting the same 
sort as 'Para9' vs. 'Parac', where 9 sorts before c.


$ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1,1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,9
____
______
Paracito,0
________
__________

In the same locale, but using a more limited field, you now have two 
prefixes 'Para' that compare identically, so the shorter string sorts first.

$ printf 'Pará,9\nParacito,0\n' | LC_ALL=C sort --field-separator=, -k1 
--debug
sort: text ordering performed using simple byte comparison
Paracito,0
__________
__________
Pará,9
_______
_______

In the C locale, every byte sorts distinct, so accents become important, 
and 'a' sorts before 'á'.

> 
> give
> 
>      Pará,9
>      Paracito,0
> 
> and
> 
>      Paracito,A
>      Pará,Z
> 
> respectively.

$ printf 'Pará,Z\nParacito,A\n' | sort --field-separator=, -k1,1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,Z
____
______
Paracito,A
________
__________

Forcing the shorter sort field by using -k1,1 gets the results you seem 
to be looking for.


> 
> Sorting the string 'á\na' results in 'a\ná', so I would expect the commands above to put Paracito before Pará, but this is not the case for the first command. Why is that?

Rather, you were probably sorting in a locale where 'a' and 'á' collate 
identically, to the point where the tie was broken by a later point in 
the line.

At any rate, since sort is behaving as required by POSIX by honoring 
your locale, and the --debug option lets you see what is going on, I see 
nothing to fix, so I'm marking this as not a bug.  However, feel free to 
respond with further followups.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org





Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Sat, 22 Aug 2020 15:52:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 42986 <at> debbugs.gnu.org and "Wolter H. V." <wolterhv <at> gmx.de> Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Sat, 22 Aug 2020 15:52:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#42986; Package coreutils. (Sat, 22 Aug 2020 19:42:02 GMT) Full text and rfc822 format available.

Message #15 received at 42986 <at> debbugs.gnu.org (full text, mbox):

From: "Wolter H. V." <wolterhv <at> gmx.de>
To: Eric Blake <eblake <at> redhat.com>, 42986 <at> debbugs.gnu.org
Subject: Re: bug#42986: sort: possible bug when sorting special characters
Date: Sat, 22 Aug 2020 16:59:23 +0100
Hello Eric,

Thank you very much for your reply. Indeed it doesn't look like a bug.
Thank you for the explanation!

Regards,

Wolter H. V.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 20 Sep 2020 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 274 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.