GNU bug report logs - #6007
locale sort ordering confusion

Previous Next

Package: coreutils;

Reported by: "Vito Di Blas" <vito.diblas <at> libero.it>

Date: Thu, 22 Apr 2010 21:45:03 UTC

Severity: normal

Tags: moreinfo

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Bob Proulx <bob <at> proulx.com>
To: Vito Di Blas <vito.diblas <at> libero.it>
Cc: 6007 <at> debbugs.gnu.org
Subject: bug#6007: sort command in Fedora10
Date: Thu, 22 Apr 2010 16:41:39 -0600
tags 6007 + moreinfo
retitle 6007 locale sort ordering confusion
thanks

Vito Di Blas wrote:
> <...>    sort  < aaa.txt  >  bbb.txt
> Cari figli, domani
> Cari figli, ieri
> Cari figli, oggi
> Cari figlioli
> Cari figliozzi
> Cari figli, pregate
> Cari figlipucci

Thank you for the bug report.  However what you are seeing is intended
behavior.  It isn't something sort has control over.  The character
collation sequence is chosen by your specified locale.  You can see
what locale you have configured with the 'locale' command.

  $ locale

> which doesn't look sorted according to my expectation.

You don't like it and I don't like it but the-powers-that-be have
confused working with data on a computer with talking about working
with data on a computer.  They have decided that the collation
ordering (sort ordering) for data should be dictionary ordering.  In
dictionary ordering case is folded together and punctuation is
ignored.  By having LANG set to any of the "en" locales the system is
instructed to use dictionary sort ordering.  This affects almost
everything on the system that sorts.  This includes commands such as
'ls' and also your shell (e.g. 'echo *') too.

> Should  I  use in Fedora some sort option or I met a bug?

Your sort order depends upon your locale.  You didn't say what your
locale was and therefore I assume that you were not aware that it
had an effect.

The documentation says:

     Unless otherwise specified, all comparisons use the character
  collating sequence specified by the `LC_COLLATE' locale.(1)
  ...
     (1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
  `en_US'), then `sort' may produce output that is sorted differently
  than you're accustomed to.  In that case, set the `LC_ALL'
  environment variable to `C'.  Note that setting only `LC_COLLATE'
  has two problems.  First, it is ineffective if `LC_ALL' is also set.
  Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if
  `LC_CTYPE' is unset) is set to an incompatible value.  For example,
  you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' but
  `LC_COLLATE' is `en_US.UTF-8'.

Personally I have the following in my $HOME/.bashrc file.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

That sets most of my locale to a UTF-8 one but forces sorting to be
standard C/POSIX.  This probably won't work in the general case since
I have no idea how that would interact with all character sets.

You may want to look at the FAQ.

  http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

> Then, in WindowsXP, I sort again the file aaa.txt with the command:
> ...
> which looks sorted as expected.

Probably that platform does not support, or is not configured for, the
same locale sets as the other host.

Bob




This bug report was last modified 15 years and 33 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.