Package: coreutils;
Reported by: Ben Mendis <dragonwisard <at> gmail.com>
Date: Tue, 11 Nov 2014 16:42:02 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Ben Mendis <dragonwisard <at> gmail.com> To: 19021 <at> debbugs.gnu.org Subject: bug#19021: closed (Re: bug#19021: Possible bug in sort) Date: Tue, 11 Nov 2014 15:07:27 -0500
[Message part 1 (text/plain, inline)]
Thanks for the explanation. This solves my issue. On Tue, Nov 11, 2014 at 12:40 PM, GNU bug Tracking System < help-debbugs <at> gnu.org> wrote: > Your bug report > > #19021: Possible bug in sort > > which was filed against the coreutils package, has been closed. > > The explanation is attached below, along with your original report. > If you require more details, please reply to 19021 <at> debbugs.gnu.org. > > -- > 19021: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19021 > GNU Bug Tracking System > Contact help-debbugs <at> gnu.org with problems > > > ---------- Forwarded message ---------- > From: Eric Blake <eblake <at> redhat.com> > To: Ben Mendis <dragonwisard <at> gmail.com>, 19021-done <at> debbugs.gnu.org > Cc: > Date: Tue, 11 Nov 2014 10:39:13 -0700 > Subject: Re: bug#19021: Possible bug in sort > tag 19021 notabug > thanks > > On 11/11/2014 09:39 AM, Ben Mendis wrote: > > > http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc > > > > Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3 > > Thanks for the report. Rather than making us chase down links, why not > provide the information inline with your email? > > > > > This results in line 7 being sorted incorrectly: sort -t , -k 1n < > weird.csv > > Try using the --debug option to see what is really happening. The bug > is NOT in sort (which correctly obeyed your locale rules and incorrect > command line), but in your command line (because you didn't tell sort > where to quit parsing numbers). > > I'm going to distill it down to a smaller input that still expresses the > same "swapped" lines: > > $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ > | sort -t, -k1n --debug > sort: using ‘en_US.UTF-8’ sorting rules > sort: key 1 is numeric and spans multiple fields > 1,73,67,6 > _________ > _________ > 2,68,61,7 > _________ > _________ > 1,69,55,14 > __________ > __________ > 2,71,59,12 > __________ > __________ > > See what's happening? The -k1n argument says to start parsing at field > 1, but continue parsing until either the input is no longer numeric or > until the end of line is reached (even if it goes into field 2 or > beyond). Since commas are silently ignored in the en_US.UTF-8 locale > when parsing a number, sort is thus comparing the values 268617 and > 1695514, and the sort was correct. > > Now, try telling sort that it must parse a numeric field, but must END > the parse at the end of the first field (if not sooner due to end of > number): > > $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ > | sort -t, -k1,1n --debug > sort: using ‘en_US.UTF-8’ sorting rules > 1,69,55,14 > _ > __________ > 1,73,67,6 > _ > _________ > 2,68,61,7 > _ > _________ > 2,71,59,12 > _ > __________ > > Or try using a locale where ',' is NOT part of a valid number: > > $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ > | LC_ALL=C sort -t, -k1n --debug > sort: using simple byte comparison > sort: key 1 is numeric and spans multiple fields > 1,69,55,14 > _ > __________ > 1,73,67,6 > _ > _________ > 2,68,61,7 > _ > _________ > 2,71,59,12 > _ > __________ > > > > > > This produced the expected results: cut -f , -d 1-3 < weird.csv | sort > -t , > > -k 1n > > Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring > from the stackoverflow site to your email). But yeah, when you truncate > to a smaller number, you are comparing different values (17367 is less > than 26861). > > > > > Using 'g' instead of 'n' also produces the expected results, but I'm not > > clear on what the difference is between 'g' and 'n'. > > -n is specified by POSIX as parsing integers according to the current > locale's definition. -g is a GNU extension, which says to parse > floating point numbers. Apparently, in the en_US.UTF-8 locale, parsing > floating point stops at the first comma, while parsing integers does not: > > $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ > | sort -t, -k1g --debug > sort: using ‘en_US.UTF-8’ sorting rules > sort: key 1 is numeric and spans multiple fields > 1,69,55,14 > _ > __________ > 1,73,67,6 > _ > _________ > 2,68,61,7 > _ > _________ > 2,71,59,12 > _ > __________ > > I don't know why libc chose to make strtoll() ignore commas while > strtold() does not, when not in the C locale. > > But at any rate, I hope I've demonstrated that the bug was in your usage > and not in sort. So I'm closing this bug, although you should feel free > to add further comments or questions. You may also want to read the FAQ: > > https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 > [Hmm - we should update that FAQ to mention the --debug option] > > -- > Eric Blake eblake redhat com +1-919-301-3266 > Libvirt virtualization library http://libvirt.org > > > > ---------- Forwarded message ---------- > From: Ben Mendis <dragonwisard <at> gmail.com> > To: bug-coreutils <at> gnu.org > Cc: > Date: Tue, 11 Nov 2014 11:39:12 -0500 > Subject: Possible bug in sort > > http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc > > Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3 > > This results in line 7 being sorted incorrectly: sort -t , -k 1n < > weird.csv > > This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t > , -k 1n > > Using 'g' instead of 'n' also produces the expected results, but I'm not > clear on what the difference is between 'g' and 'n'. > > Tested with sort 8.21 on Slackware64-current. > >
[Message part 2 (text/html, inline)]
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.