GNU bug report logs -
#6007
locale sort ordering confusion
Previous Next
Reported by: "Vito Di Blas" <vito.diblas <at> libero.it>
Date: Thu, 22 Apr 2010 21:45:03 UTC
Severity: normal
Tags: moreinfo
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6007 in the body.
You can then email your comments to 6007 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6007
; Package
coreutils
.
(Thu, 22 Apr 2010 21:45:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"Vito Di Blas" <vito.diblas <at> libero.it>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Thu, 22 Apr 2010 21:45:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Dear Friends, in Linux Fedora10, I sort the file aaa.txt :
Cari figliozzi
Cari figlipucci
Cari figli, oggi
Cari figli, ieri
Cari figli, domani
Cari figli, pregate
Cari figlioli
with the command:
<...> sort < aaa.txt > bbb.txt
and I obtain the file bbb.txt
Cari figli, domani
Cari figli, ieri
Cari figli, oggi
Cari figlioli
Cari figliozzi
Cari figli, pregate
Cari figlipucci
which doesn't look sorted according to my expectation.
Then, in WindowsXP, I sort again the file aaa.txt with the command:
<...> sort aaa.txt > ccc.txt
and I get the file ccc.txt :
Cari figli, domani
Cari figli, ieri
Cari figli, oggi
Cari figli, pregate
Cari figlioli
Cari figliozzi
Cari figlipucci
which looks sorted as expected.
Should I use in Fedora some sort option or I met a bug?
Thanks for your attention and best regards
Vito Di Blas Ivrea Italy
vito.diblas <at> libero.it
[Message part 2 (text/html, inline)]
Reply sent
to
Eric Blake <eblake <at> redhat.com>
:
You have taken responsibility.
(Thu, 22 Apr 2010 22:29:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
"Vito Di Blas" <vito.diblas <at> libero.it>
:
bug acknowledged by developer.
(Thu, 22 Apr 2010 22:29:02 GMT)
Full text and
rfc822 format available.
Message #10 received at 6007-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 04/22/2010 03:34 PM, Vito Di Blas wrote:
> and I obtain the file bbb.txt
>
> Cari figli, domani
> Cari figli, ieri
> Cari figli, oggi
> Cari figlioli
> Cari figliozzi
> Cari figli, pregate
> Cari figlipucci
>
>
> which doesn't look sorted according to my expectation.
Not a bug, if you are in a locale where the collating order discards
punctuation and whitespace as insignificant.
> Then, in WindowsXP, I sort again the file aaa.txt with the command:
>
> <...> sort aaa.txt > ccc.txt
>
> and I get the file ccc.txt :
>
> Cari figli, domani
> Cari figli, ieri
> Cari figli, oggi
> Cari figli, pregate
> Cari figlioli
> Cari figliozzi
> Cari figlipucci
This is due to a difference in the default locales of your two systems.
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
Try again with 'LC_ALL=C sort aaa.txt' to see the difference.
Personally, I have 'export LC_COLLATE=C' in my ~/.bashrc in order to
guarantee traditional sorting, while everything else continues to follow
my default locale.
--
Eric Blake eblake <at> redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6007
; Package
coreutils
.
(Thu, 22 Apr 2010 22:42:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 6007 <at> debbugs.gnu.org (full text, mbox):
tags 6007 + moreinfo
retitle 6007 locale sort ordering confusion
thanks
Vito Di Blas wrote:
> <...> sort < aaa.txt > bbb.txt
> Cari figli, domani
> Cari figli, ieri
> Cari figli, oggi
> Cari figlioli
> Cari figliozzi
> Cari figli, pregate
> Cari figlipucci
Thank you for the bug report. However what you are seeing is intended
behavior. It isn't something sort has control over. The character
collation sequence is chosen by your specified locale. You can see
what locale you have configured with the 'locale' command.
$ locale
> which doesn't look sorted according to my expectation.
You don't like it and I don't like it but the-powers-that-be have
confused working with data on a computer with talking about working
with data on a computer. They have decided that the collation
ordering (sort ordering) for data should be dictionary ordering. In
dictionary ordering case is folded together and punctuation is
ignored. By having LANG set to any of the "en" locales the system is
instructed to use dictionary sort ordering. This affects almost
everything on the system that sorts. This includes commands such as
'ls' and also your shell (e.g. 'echo *') too.
> Should I use in Fedora some sort option or I met a bug?
Your sort order depends upon your locale. You didn't say what your
locale was and therefore I assume that you were not aware that it
had an effect.
The documentation says:
Unless otherwise specified, all comparisons use the character
collating sequence specified by the `LC_COLLATE' locale.(1)
...
(1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
`en_US'), then `sort' may produce output that is sorted differently
than you're accustomed to. In that case, set the `LC_ALL'
environment variable to `C'. Note that setting only `LC_COLLATE'
has two problems. First, it is ineffective if `LC_ALL' is also set.
Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if
`LC_CTYPE' is unset) is set to an incompatible value. For example,
you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' but
`LC_COLLATE' is `en_US.UTF-8'.
Personally I have the following in my $HOME/.bashrc file.
export LANG=en_US.UTF-8
export LC_COLLATE=C
That sets most of my locale to a UTF-8 one but forces sorting to be
standard C/POSIX. This probably won't work in the general case since
I have no idea how that would interact with all character sets.
You may want to look at the FAQ.
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
> Then, in WindowsXP, I sort again the file aaa.txt with the command:
> ...
> which looks sorted as expected.
Probably that platform does not support, or is not configured for, the
same locale sets as the other host.
Bob
Added tag(s) moreinfo.
Request was from
Bob Proulx <bob <at> proulx.com>
to
control <at> debbugs.gnu.org
.
(Thu, 22 Apr 2010 22:42:02 GMT)
Full text and
rfc822 format available.
Changed bug title to 'locale sort ordering confusion' from 'sort command in Fedora10'
Request was from
Bob Proulx <bob <at> proulx.com>
to
control <at> debbugs.gnu.org
.
(Thu, 22 Apr 2010 22:42:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6007
; Package
coreutils
.
(Thu, 22 Apr 2010 23:09:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 6007 <at> debbugs.gnu.org (full text, mbox):
Bob Proulx writes:
>
> You don't like it and I don't like it but the-powers-that-be have
Who's the "power" here anyway? Who do we have to impeach? Seriously. The
"en_US" locale is an unmitigated disaster. It's officially called "not a bug"
every time it comes up, which seems to be once a week on this list alone, so
what volume of complaints is required to tip the balance to "all right it's a
damn bug let's fix it"?
From the name "en_US" one might guess that it represents the behavior
expected by English-speaking users in or from the US. But those users have
lived with computers for a generation or two. What they expect is
ASCIIbetical. The only people who actually expect phone-book-style sorting
are old geezers who remember what a phone book was. Most of them have never
used a computer and never will, so why do we (and by "we" I mean whoever
makes the locale rules) bend the default to accommodate them?
--
Alan Curry
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6007
; Package
coreutils
.
(Thu, 22 Apr 2010 23:43:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 6007 <at> debbugs.gnu.org (full text, mbox):
Alan Curry wrote:
> Bob Proulx writes:
> > You don't like it and I don't like it but the-powers-that-be have
>
> Who's the "power" here anyway? Who do we have to impeach? Seriously. The
> "en_US" locale is an unmitigated disaster. It's officially called "not a bug"
> every time it comes up, which seems to be once a week on this list alone, so
> what volume of complaints is required to tip the balance to "all right it's a
> damn bug let's fix it"?
As far as I know, which isn't as much as I would like especially in
this case, it is implemented in libc. Therefore it would need to be
addressed with libc folks.
http://www.gnu.org/software/libc/
But very likely the chain continues well beyond that point. If you
find out, please educate me.
> From the name "en_US" one might guess that it represents the behavior
> expected by English-speaking users in or from the US. But those users have
> lived with computers for a generation or two. What they expect is
> ASCIIbetical. The only people who actually expect phone-book-style sorting
> are old geezers who remember what a phone book was. Most of them have never
> used a computer and never will, so why do we (and by "we" I mean whoever
> makes the locale rules) bend the default to accommodate them?
It would be nice to be able to set my locale to en_US <at> C.UTF-8 or
en_US <at> POSIX.UTF-8 and get a better behaved collation sequence.
Bob
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6007
; Package
coreutils
.
(Fri, 23 Apr 2010 08:17:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 6007 <at> debbugs.gnu.org (full text, mbox):
"Alan Curry" <pacman-cu <at> kosh.dhis.org> writes:
> Who's the "power" here anyway?
You are, actually. Everyone can define locales to behave the way he
likes, see localedef(1).
> From the name "en_US" one might guess that it represents the behavior
> expected by English-speaking users in or from the US. But those users
> have lived with computers for a generation or two. What they expect is
> ASCIIbetical.
Nowadays most people don't know what ASCII is.
Andreas.
--
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6007
; Package
coreutils
.
(Fri, 23 Apr 2010 08:48:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 6007 <at> debbugs.gnu.org (full text, mbox):
Andreas Schwab writes:
>
> "Alan Curry" <pacman-cu <at> kosh.dhis.org> writes:
>
> > Who's the "power" here anyway?
>
> You are, actually. Everyone can define locales to behave the way he
> likes, see localedef(1).
I avoid this by not having any locales installed. But that doesn't help all
the other victims.
>
> > From the name "en_US" one might guess that it represents the behavior
> > expected by English-speaking users in or from the US. But those users
> > have lived with computers for a generation or two. What they expect is
> > ASCIIbetical.
>
> Nowadays most people don't know what ASCII is.
They may not know how to name it, but they do complain when it isn't used,
enough that it's a FAQ.
People install a GNU/Linux distribution, pick "English" from the language
menu, and get a set of sorting rules that doesn't makes sense. Sorry, should
have told the installer you speak "C".
"Donna Summer" just doesn't belong between "Don Adams" and "Don Pardo", and
everyone knows it. Not a bug? Bah. Not a coreutils bug, but it's a bug. If
glibc was in the same bug tracking system with coreutils, reports like this
one could be reassigned there.
--
Alan Curry
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6007
; Package
coreutils
.
(Fri, 23 Apr 2010 19:08:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 6007 <at> debbugs.gnu.org (full text, mbox):
Andreas Schwab wrote:
> Alan Curry writes:
> > From the name "en_US" one might guess that it represents the behavior
> > expected by English-speaking users in or from the US. But those users
> > have lived with computers for a generation or two. What they expect is
> > ASCIIbetical.
>
> Nowadays most people don't know what ASCII is.
Even fewer know about EBCDIC. Or why native host byte ordering might
differ between machines with different encodings.
Bob
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 22 May 2010 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 15 years and 32 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.