GNU bug report logs - #7878
"sort" bug--inconsistent single-column sorting influenced by other columns?

Previous Next

Package: coreutils;

Reported by: Randall Lewis <ralewis <at> yahoo-inc.com>

Date: Fri, 21 Jan 2011 02:36:02 UTC

Severity: normal

Done: Bob Proulx <bob <at> proulx.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Bob Proulx <bob <at> proulx.com>
To: Randall Lewis <ralewis <at> yahoo-inc.com>
Cc: 7878 <at> debbugs.gnu.org
Subject: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns?
Date: Thu, 20 Jan 2011 23:02:11 -0700
Randall Lewis wrote:
> "sort" does inconsistent sorting.

You are sure about that?  :-)

> I'm pretty sure it has NOTHING to do with the following warning,
> although I could be totally wrong.
> 
> " *** WARNING ***
> The locale specified by the environment affects sort order.
> Set LC_ALL=C to get the traditional sort order that uses
> native byte values. "

You read this, know that sort will base the sorting upon the locale
setting, but didn't tell us what locale you were using to sort?  Shame
on you.  Because you *know* I am going to ask you about it! :-)

What locale are you using?  C?  en_US.UTF-8?  Some other?  The locale
command will print this information.  Here is an example from my system.

  $ locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_COLLATE=C
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_PAPER="en_US.UTF-8"
  LC_NAME="en_US.UTF-8"
  LC_ADDRESS="en_US.UTF-8"
  LC_TELEPHONE="en_US.UTF-8"
  LC_MEASUREMENT="en_US.UTF-8"
  LC_IDENTIFICATION="en_US.UTF-8"
  LC_ALL=

> sort test1.txt
> 323|1
> 36|2
> 40|4
> 406|3
> 587|5

> sort test7.txt
> 323|B1
> 36|C2
> 406|B3
> 40|B4
> 587|C5

Looks okay to me for the en_US.UTF-8 locale.  But it will of course be
different in the C locale.

  $ LC_ALL=en_US.UTF-8 sort test1.txt 
  323|1
  36|2
  40|4
  406|3
  587|5

  $ LC_ALL=C sort test1.txt 
  323|1
  36|2
  406|3
  40|4
  587|5

What ordering did you expect there?  I assume you are expecting to see
these sorted as in the C locale?

> The rows are in a different order depending on the dataset--and it
> is NOT a numeric sort. I'm not even sure it is is ANY type of sort.

It is a character sort.  A string sort.  It is comparing the line of
characters from start to finish.  But it uses the system's collation
tables based upon the locale.  In the en_US.UTF-8 locale punctuation
is ignored and case is folded.  I don't like it but the powers that be
have decreed it.

Please see the FAQ:

  http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

The standards documentation:

  http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html

Variables that control localization:

  http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02

> sort -k1 -t "|" test1.txt

Hint: If you ever think you need to use -k POS1 then you almost always
should be using -k POS1,POS2 to specify where you want the sort to
stop comparing.  Otherwise it compares all of the way to the end of
the line.

> But why did it sort inconsistently in the first place based on the
> other contents of the file rather than just focusing on the first
> column--even when I told it to?

You never told it not to continue comparing all of the way to the end
of the line.  For example this way:

  $ sort -t'|' -k1,1n -k2,2n test1.txt 
  36|2
  40|4
  323|1
  406|3
  587|5

That won't help you with join since that expects a non-numeric sort
ordering.

> Inconsistent sorting when combined with 'join' provides incorrect
> matches and duplication of records. This is a mess.

Yes.  Recent versions of join detect and warn about this.  Recent
versions of sort have a --debug option that can help to identify
problem cases.

Bob




This bug report was last modified 14 years and 126 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.