Package: coreutils;
Reported by: Randall Lewis <ralewis <at> yahoo-inc.com>
Date: Fri, 21 Jan 2011 02:36:02 UTC
Severity: normal
Done: Bob Proulx <bob <at> proulx.com>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Randall Lewis <ralewis <at> yahoo-inc.com> To: Bob Proulx <bob <at> proulx.com> Cc: "7878 <at> debbugs.gnu.org" <7878 <at> debbugs.gnu.org> Subject: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Date: Thu, 20 Jan 2011 23:29:42 -0800
Hi Bob-- Wow! So, a couple comments about how I seem to have figured out every wrong way to use "sort" when also using "join." Who would've thought that sort -k1 test1.txt would default to sort on the entire line? (I normally would've thought that [,POS2] means "optional if you want to have it keep going beyond the first field.") Also, who would've thought that the default "sort" would be incompatible with "join" and that you would need to write the command like this every time you wanted to use "join"? LC_ALL=C sort test1.txt Or that you would need a special type of "pre-sort" on the column (which I was executing wrong)? sort -k1,1 -t "|" test1.txt Regardless, here is "locale" (for the record, I'm pretty new to the utilities--and love them. I'm not a computer scientist, but rather an economist trying to fit in at Yahoo! with the engineers and computer scientists). I'm sure there's a good reason why there are two, and it's pretty clear that I novice enough that I'll have to learn that later. bash-3.2$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= Thanks, Bob, for sharing two separate ways that I could get the answer the way I need it--two ways I could not have come up with on my own. Thanks! --Randall P.S. So, the reason why sorting on the column didn't work for me was because it was plucking out the delimiter and then doing a string sort? Then it was string sorting, putting numbers before letters (as you might expect it to)? bash-3.2$ sort test1.txt 323|1 36|2 406|3 40|7 <-- Changed from 4 to 7 changed the sort order. 587|5 bash-3.2$ sort test1.txt 323|1 36|2 40|4 406|3 587|5 -----Original Message----- From: Bob Proulx [mailto:bob <at> proulx.com] Sent: Thursday, January 20, 2011 10:02 PM To: Randall Lewis Cc: 7878 <at> debbugs.gnu.org Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Randall Lewis wrote: > "sort" does inconsistent sorting. You are sure about that? :-) > I'm pretty sure it has NOTHING to do with the following warning, > although I could be totally wrong. > > " *** WARNING *** > The locale specified by the environment affects sort order. > Set LC_ALL=C to get the traditional sort order that uses > native byte values. " You read this, know that sort will base the sorting upon the locale setting, but didn't tell us what locale you were using to sort? Shame on you. Because you *know* I am going to ask you about it! :-) What locale are you using? C? en_US.UTF-8? Some other? The locale command will print this information. Here is an example from my system. $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE=C LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= > sort test1.txt > 323|1 > 36|2 > 40|4 > 406|3 > 587|5 > sort test7.txt > 323|B1 > 36|C2 > 406|B3 > 40|B4 > 587|C5 Looks okay to me for the en_US.UTF-8 locale. But it will of course be different in the C locale. $ LC_ALL=en_US.UTF-8 sort test1.txt 323|1 36|2 40|4 406|3 587|5 $ LC_ALL=C sort test1.txt 323|1 36|2 406|3 40|4 587|5 What ordering did you expect there? I assume you are expecting to see these sorted as in the C locale? > The rows are in a different order depending on the dataset--and it > is NOT a numeric sort. I'm not even sure it is is ANY type of sort. It is a character sort. A string sort. It is comparing the line of characters from start to finish. But it uses the system's collation tables based upon the locale. In the en_US.UTF-8 locale punctuation is ignored and case is folded. I don't like it but the powers that be have decreed it. Please see the FAQ: http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 The standards documentation: http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html Variables that control localization: http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02 > sort -k1 -t "|" test1.txt Hint: If you ever think you need to use -k POS1 then you almost always should be using -k POS1,POS2 to specify where you want the sort to stop comparing. Otherwise it compares all of the way to the end of the line. > But why did it sort inconsistently in the first place based on the > other contents of the file rather than just focusing on the first > column--even when I told it to? You never told it not to continue comparing all of the way to the end of the line. For example this way: $ sort -t'|' -k1,1n -k2,2n test1.txt 36|2 40|4 323|1 406|3 587|5 That won't help you with join since that expects a non-numeric sort ordering. > Inconsistent sorting when combined with 'join' provides incorrect > matches and duplication of records. This is a mess. Yes. Recent versions of join detect and warn about this. Recent versions of sort have a --debug option that can help to identify problem cases. Bob
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.