GNU bug report logs - #19021
Possible bug in sort

Previous Next

Package: coreutils;

Reported by: Ben Mendis <dragonwisard <at> gmail.com>

Date: Tue, 11 Nov 2014 16:42:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19021 in the body.
You can then email your comments to 19021 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#19021; Package coreutils. (Tue, 11 Nov 2014 16:42:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ben Mendis <dragonwisard <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 11 Nov 2014 16:42:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ben Mendis <dragonwisard <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Possible bug in sort
Date: Tue, 11 Nov 2014 11:39:12 -0500
[Message part 1 (text/plain, inline)]
http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc

Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3

This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv

This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t ,
-k 1n

Using 'g' instead of 'n' also produces the expected results, but I'm not
clear on what the difference is between 'g' and 'n'.

Tested with sort 8.21 on Slackware64-current.
[Message part 2 (text/html, inline)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Tue, 11 Nov 2014 17:40:03 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Tue, 11 Nov 2014 17:40:05 GMT) Full text and rfc822 format available.

Notification sent to Ben Mendis <dragonwisard <at> gmail.com>:
bug acknowledged by developer. (Tue, 11 Nov 2014 17:40:06 GMT) Full text and rfc822 format available.

Message #12 received at 19021-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Ben Mendis <dragonwisard <at> gmail.com>, 19021-done <at> debbugs.gnu.org
Subject: Re: bug#19021: Possible bug in sort
Date: Tue, 11 Nov 2014 10:39:13 -0700
[Message part 1 (text/plain, inline)]
tag 19021 notabug
thanks

On 11/11/2014 09:39 AM, Ben Mendis wrote:
> http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc
> 
> Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3

Thanks for the report.  Rather than making us chase down links, why not
provide the information inline with your email?

> 
> This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv

Try using the --debug option to see what is really happening.  The bug
is NOT in sort (which correctly obeyed your locale rules and incorrect
command line), but in your command line (because you didn't tell sort
where to quit parsing numbers).

I'm going to distill it down to a smaller input that still expresses the
same "swapped" lines:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,73,67,6
_________
_________
2,68,61,7
_________
_________
1,69,55,14
__________
__________
2,71,59,12
__________
__________

See what's happening? The -k1n argument says to start parsing at field
1, but continue parsing until either the input is no longer numeric or
until the end of line is reached (even if it goes into field 2 or
beyond). Since commas are silently ignored in the en_US.UTF-8 locale
when parsing a number, sort is thus comparing the values 268617 and
1695514, and the sort was correct.

Now, try telling sort that it must parse a numeric field, but must END
the parse at the end of the first field (if not sooner due to end of
number):

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1,1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

Or try using a locale where ',' is NOT part of a valid number:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | LC_ALL=C sort -t, -k1n --debug
sort: using simple byte comparison
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________


> 
> This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t ,
> -k 1n

Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring
from the stackoverflow site to your email).  But yeah, when you truncate
to a smaller number, you are comparing different values (17367 is less
than 26861).

> 
> Using 'g' instead of 'n' also produces the expected results, but I'm not
> clear on what the difference is between 'g' and 'n'.

-n is specified by POSIX as parsing integers according to the current
locale's definition.  -g is a GNU extension, which says to parse
floating point numbers.  Apparently, in the en_US.UTF-8 locale, parsing
floating point stops at the first comma, while parsing integers does not:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1g --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

I don't know why libc chose to make strtoll() ignore commas while
strtold() does not, when not in the C locale.

But at any rate, I hope I've demonstrated that the bug was in your usage
and not in sort.  So I'm closing this bug, although you should feel free
to add further comments or questions.  You may also want to read the FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
[Hmm - we should update that FAQ to mention the --debug option]

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#19021; Package coreutils. (Tue, 11 Nov 2014 18:29:01 GMT) Full text and rfc822 format available.

Message #15 received at 19021-done <at> debbugs.gnu.org (full text, mbox):

From: Leslie S Satenstein <lsatenstein <at> yahoo.com>
To: Eric Blake <eblake <at> redhat.com>, Ben Mendis <dragonwisard <at> gmail.com>, 
 "19021-done <at> debbugs.gnu.org" <19021-done <at> debbugs.gnu.org>
Subject: Re: bug#19021: Possible bug in sort
Date: Tue, 11 Nov 2014 18:27:49 +0000 (UTC)
[Message part 1 (text/plain, inline)]
Why not have used  sort  -t ',' -k 1n  ?
 Regards 
 Leslie
 Mr. Leslie Satenstein
Montréal Québec, Canada


 
      From: Eric Blake <eblake <at> redhat.com>
 To: Ben Mendis <dragonwisard <at> gmail.com>; 19021-done <at> debbugs.gnu.org 
 Sent: Tuesday, November 11, 2014 12:39 PM
 Subject: bug#19021: Possible bug in sort
   
tag 19021 notabug
thanks

On 11/11/2014 09:39 AM, Ben Mendis wrote:
> http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc
> 
> Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3

Thanks for the report.  Rather than making us chase down links, why not
provide the information inline with your email?

> 
> This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv

Try using the --debug option to see what is really happening.  The bug
is NOT in sort (which correctly obeyed your locale rules and incorrect
command line), but in your command line (because you didn't tell sort
where to quit parsing numbers).

I'm going to distill it down to a smaller input that still expresses the
same "swapped" lines:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,73,67,6
_________
_________
2,68,61,7
_________
_________
1,69,55,14
__________
__________
2,71,59,12
__________
__________

See what's happening? The -k1n argument says to start parsing at field
1, but continue parsing until either the input is no longer numeric or
until the end of line is reached (even if it goes into field 2 or
beyond). Since commas are silently ignored in the en_US.UTF-8 locale
when parsing a number, sort is thus comparing the values 268617 and
1695514, and the sort was correct.

Now, try telling sort that it must parse a numeric field, but must END
the parse at the end of the first field (if not sooner due to end of
number):

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1,1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

Or try using a locale where ',' is NOT part of a valid number:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | LC_ALL=C sort -t, -k1n --debug
sort: using simple byte comparison
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________


> 
> This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t ,
> -k 1n

Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring
from the stackoverflow site to your email).  But yeah, when you truncate
to a smaller number, you are comparing different values (17367 is less
than 26861).



> 
> Using 'g' instead of 'n' also produces the expected results, but I'm not
> clear on what the difference is between 'g' and 'n'.

-n is specified by POSIX as parsing integers according to the current
locale's definition.  -g is a GNU extension, which says to parse
floating point numbers.  Apparently, in the en_US.UTF-8 locale, parsing
floating point stops at the first comma, while parsing integers does not:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1g --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

I don't know why libc chose to make strtoll() ignore commas while
strtold() does not, when not in the C locale.

But at any rate, I hope I've demonstrated that the bug was in your usage
and not in sort.  So I'm closing this bug, although you should feel free
to add further comments or questions.  You may also want to read the FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
[Hmm - we should update that FAQ to mention the --debug option]

-- 
Eric Blake  eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


   
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#19021; Package coreutils. (Tue, 11 Nov 2014 19:30:03 GMT) Full text and rfc822 format available.

Message #18 received at 19021-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Leslie S Satenstein <lsatenstein <at> yahoo.com>,
 Ben Mendis <dragonwisard <at> gmail.com>,
 "19021-done <at> debbugs.gnu.org" <19021-done <at> debbugs.gnu.org>
Subject: Re: bug#19021: Possible bug in sort
Date: Tue, 11 Nov 2014 12:29:33 -0700
[Message part 1 (text/plain, inline)]
On 11/11/2014 11:27 AM, Leslie S Satenstein wrote:

[please don't top-post on technical lists - it makes it harder to figure
out what you are asking]

> Why not have used  sort  -t ',' -k 1n  ?

> 
>>
>> This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv

Are you asking the difference between:

sort -t , -k 1n
sort -t ',' -k 1n

If so, there's no difference.  The shell strips the '' quoting around ,
before invoking sort, so argv[] is the same in either spelling from the
shell.

But that has nothing to do with the bug report, where the answer is that
the caller should have been using:

sort -t , -k 1,1n

or

LC_ALL=C sort -t , -k 1n

or the combination:

LC_ALL=C sort -t , -k 1,1n

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#19021; Package coreutils. (Tue, 11 Nov 2014 20:08:02 GMT) Full text and rfc822 format available.

Message #21 received at 19021 <at> debbugs.gnu.org (full text, mbox):

From: Ben Mendis <dragonwisard <at> gmail.com>
To: 19021 <at> debbugs.gnu.org
Subject: Re: bug#19021: closed (Re: bug#19021: Possible bug in sort)
Date: Tue, 11 Nov 2014 15:07:27 -0500
[Message part 1 (text/plain, inline)]
Thanks for the explanation. This solves my issue.

On Tue, Nov 11, 2014 at 12:40 PM, GNU bug Tracking System <
help-debbugs <at> gnu.org> wrote:

> Your bug report
>
> #19021: Possible bug in sort
>
> which was filed against the coreutils package, has been closed.
>
> The explanation is attached below, along with your original report.
> If you require more details, please reply to 19021 <at> debbugs.gnu.org.
>
> --
> 19021: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19021
> GNU Bug Tracking System
> Contact help-debbugs <at> gnu.org with problems
>
>
> ---------- Forwarded message ----------
> From: Eric Blake <eblake <at> redhat.com>
> To: Ben Mendis <dragonwisard <at> gmail.com>, 19021-done <at> debbugs.gnu.org
> Cc:
> Date: Tue, 11 Nov 2014 10:39:13 -0700
> Subject: Re: bug#19021: Possible bug in sort
> tag 19021 notabug
> thanks
>
> On 11/11/2014 09:39 AM, Ben Mendis wrote:
> >
> http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc
> >
> > Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3
>
> Thanks for the report.  Rather than making us chase down links, why not
> provide the information inline with your email?
>
> >
> > This results in line 7 being sorted incorrectly: sort -t , -k 1n <
> weird.csv
>
> Try using the --debug option to see what is really happening.  The bug
> is NOT in sort (which correctly obeyed your locale rules and incorrect
> command line), but in your command line (because you didn't tell sort
> where to quit parsing numbers).
>
> I'm going to distill it down to a smaller input that still expresses the
> same "swapped" lines:
>
> $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
>  | sort -t, -k1n --debug
> sort: using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 1,73,67,6
> _________
> _________
> 2,68,61,7
> _________
> _________
> 1,69,55,14
> __________
> __________
> 2,71,59,12
> __________
> __________
>
> See what's happening? The -k1n argument says to start parsing at field
> 1, but continue parsing until either the input is no longer numeric or
> until the end of line is reached (even if it goes into field 2 or
> beyond). Since commas are silently ignored in the en_US.UTF-8 locale
> when parsing a number, sort is thus comparing the values 268617 and
> 1695514, and the sort was correct.
>
> Now, try telling sort that it must parse a numeric field, but must END
> the parse at the end of the first field (if not sooner due to end of
> number):
>
> $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
>  | sort -t, -k1,1n --debug
> sort: using ‘en_US.UTF-8’ sorting rules
> 1,69,55,14
> _
> __________
> 1,73,67,6
> _
> _________
> 2,68,61,7
> _
> _________
> 2,71,59,12
> _
> __________
>
> Or try using a locale where ',' is NOT part of a valid number:
>
> $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
>  | LC_ALL=C sort -t, -k1n --debug
> sort: using simple byte comparison
> sort: key 1 is numeric and spans multiple fields
> 1,69,55,14
> _
> __________
> 1,73,67,6
> _
> _________
> 2,68,61,7
> _
> _________
> 2,71,59,12
> _
> __________
>
>
> >
> > This produced the expected results: cut -f , -d 1-3 < weird.csv | sort
> -t ,
> > -k 1n
>
> Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring
> from the stackoverflow site to your email).  But yeah, when you truncate
> to a smaller number, you are comparing different values (17367 is less
> than 26861).
>
> >
> > Using 'g' instead of 'n' also produces the expected results, but I'm not
> > clear on what the difference is between 'g' and 'n'.
>
> -n is specified by POSIX as parsing integers according to the current
> locale's definition.  -g is a GNU extension, which says to parse
> floating point numbers.  Apparently, in the en_US.UTF-8 locale, parsing
> floating point stops at the first comma, while parsing integers does not:
>
> $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
>  | sort -t, -k1g --debug
> sort: using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 1,69,55,14
> _
> __________
> 1,73,67,6
> _
> _________
> 2,68,61,7
> _
> _________
> 2,71,59,12
> _
> __________
>
> I don't know why libc chose to make strtoll() ignore commas while
> strtold() does not, when not in the C locale.
>
> But at any rate, I hope I've demonstrated that the bug was in your usage
> and not in sort.  So I'm closing this bug, although you should feel free
> to add further comments or questions.  You may also want to read the FAQ:
>
> https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
> [Hmm - we should update that FAQ to mention the --debug option]
>
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>
>
> ---------- Forwarded message ----------
> From: Ben Mendis <dragonwisard <at> gmail.com>
> To: bug-coreutils <at> gnu.org
> Cc:
> Date: Tue, 11 Nov 2014 11:39:12 -0500
> Subject: Possible bug in sort
>
> http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc
>
> Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3
>
> This results in line 7 being sorted incorrectly: sort -t , -k 1n <
> weird.csv
>
> This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t
> , -k 1n
>
> Using 'g' instead of 'n' also produces the expected results, but I'm not
> clear on what the difference is between 'g' and 'n'.
>
> Tested with sort 8.21 on Slackware64-current.
>
>
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 10 Dec 2014 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 190 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.