Why not have used  sort  -t ',' -k 1n  ?  Regards  Leslie Mr. Leslie Satenstein Montréal Québec, Canada From: Eric Blake To: Ben Mendis ; 19021-done@debbugs.gnu.org Sent: Tuesday, November 11, 2014 12:39 PM Subject: bug#19021: Possible bug in sort tag 19021 notabug thanks On 11/11/2014 09:39 AM, Ben Mendis wrote: > http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc > > Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3 Thanks for the report.  Rather than making us chase down links, why not provide the information inline with your email? > > This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv Try using the --debug option to see what is really happening.  The bug is NOT in sort (which correctly obeyed your locale rules and incorrect command line), but in your command line (because you didn't tell sort where to quit parsing numbers). I'm going to distill it down to a smaller input that still expresses the same "swapped" lines: $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ | sort -t, -k1n --debug sort: using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 1,73,67,6 _________ _________ 2,68,61,7 _________ _________ 1,69,55,14 __________ __________ 2,71,59,12 __________ __________ See what's happening? The -k1n argument says to start parsing at field 1, but continue parsing until either the input is no longer numeric or until the end of line is reached (even if it goes into field 2 or beyond). Since commas are silently ignored in the en_US.UTF-8 locale when parsing a number, sort is thus comparing the values 268617 and 1695514, and the sort was correct. Now, try telling sort that it must parse a numeric field, but must END the parse at the end of the first field (if not sooner due to end of number): $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ | sort -t, -k1,1n --debug sort: using ‘en_US.UTF-8’ sorting rules 1,69,55,14 _ __________ 1,73,67,6 _ _________ 2,68,61,7 _ _________ 2,71,59,12 _ __________ Or try using a locale where ',' is NOT part of a valid number: $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ | LC_ALL=C sort -t, -k1n --debug sort: using simple byte comparison sort: key 1 is numeric and spans multiple fields 1,69,55,14 _ __________ 1,73,67,6 _ _________ 2,68,61,7 _ _________ 2,71,59,12 _ __________ > > This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t , > -k 1n Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring from the stackoverflow site to your email).  But yeah, when you truncate to a smaller number, you are comparing different values (17367 is less than 26861). > > Using 'g' instead of 'n' also produces the expected results, but I'm not > clear on what the difference is between 'g' and 'n'. -n is specified by POSIX as parsing integers according to the current locale's definition.  -g is a GNU extension, which says to parse floating point numbers.  Apparently, in the en_US.UTF-8 locale, parsing floating point stops at the first comma, while parsing integers does not: $ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \ | sort -t, -k1g --debug sort: using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 1,69,55,14 _ __________ 1,73,67,6 _ _________ 2,68,61,7 _ _________ 2,71,59,12 _ __________ I don't know why libc chose to make strtoll() ignore commas while strtold() does not, when not in the C locale. But at any rate, I hope I've demonstrated that the bug was in your usage and not in sort.  So I'm closing this bug, although you should feel free to add further comments or questions.  You may also want to read the FAQ: https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 [Hmm - we should update that FAQ to mention the --debug option] -- Eric Blake  eblake redhat com    +1-919-301-3266 Libvirt virtualization library http://libvirt.org