GNU bug report logs - #19142
sort not working with LANG set to language_country.encoding

Previous Next

Package: coreutils;

Reported by: Roland Sieker <ospalh <at> gmail.com>

Date: Fri, 21 Nov 2014 16:49:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19142 in the body.
You can then email your comments to 19142 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#19142; Package coreutils. (Fri, 21 Nov 2014 16:49:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roland Sieker <ospalh <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 21 Nov 2014 16:49:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Roland Sieker <ospalh <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: sort not working with LANG set to language_country.encoding
Date: Fri, 21 Nov 2014 12:24:56 +0100
[Message part 1 (text/plain, inline)]
Hi.

I have noticed that sort seems to have problems when the LANG environment
variable is set with language and country.

As a test case, i tried to sort

a
b
a
⺌
⺕
⺌

It sorts OK like this, with LANG just the language.encoding:
( setenv LANG en.UTF-8 ; echo 'a\nb\na\n⺌\n⺕\n⺌' | sort )
a
a
b
⺌
⺌
⺕

But not with LANG as language_country.encoding:
( setenv LANG en_GB.UTF-8 ; echo 'a\nb\na\n⺌\n⺕\n⺌' | sort )
⺌
⺕
⺌
a
a
b




sort: sort (GNU coreutils) 8.21
Shell: tcsh 6.18.01 (Astron) 2012-02-14 (x86_64-unknown-linux) options
wide,nls,dl,al,kan,rh,color,filec
Fedora Linux 20

Regards, ospalh
[Message part 2 (text/html, inline)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Fri, 21 Nov 2014 17:00:03 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Fri, 21 Nov 2014 17:00:05 GMT) Full text and rfc822 format available.

Notification sent to Roland Sieker <ospalh <at> gmail.com>:
bug acknowledged by developer. (Fri, 21 Nov 2014 17:00:06 GMT) Full text and rfc822 format available.

Message #12 received at 19142-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Roland Sieker <ospalh <at> gmail.com>, 19142-done <at> debbugs.gnu.org
Subject: Re: bug#19142: sort not working with LANG set to
 language_country.encoding
Date: Fri, 21 Nov 2014 09:59:20 -0700
[Message part 1 (text/plain, inline)]
tag 19142 notabug
thanks

On 11/21/2014 04:24 AM, Roland Sieker wrote:
> Hi.
> 
> I have noticed that sort seems to have problems when the LANG environment
> variable is set with language and country.
> 

Thanks for the report.  The whole point of locales is that each locale
is free to choose the collation sequences that make the most sense for
that locale.


> It sorts OK like this, with LANG just the language.encoding:
> ( setenv LANG en.UTF-8 ; echo 'a\nb\na\n⺌\n⺕\n⺌' | sort )

[I'm translating your csh syntax into more-reliable sh syntax]
Try turning on sort debugging:

$ printf 'a\nb\na\n⺌\n⺕\n⺌' | LC_ALL=en.UTF-8 sort --debug
sort: using simple byte comparison
a
_
a
_
b
_
⺌
___
⺌
___
⺕
___


> But not with LANG as language_country.encoding:

$ printf 'a\nb\na\n⺌\n⺕\n⺌' | LC_ALL=en_GB.UTF-8 sort --debug
sort: using ‘en_GB.UTF-8’ sorting rules
⺌
__
⺕
__
⺌
__
a
_
a
_
b
_


That just means that whoever wrote the en_GB.UTF-8 locale picked a
different collation sequence for non-ascii characters than the person
that wrote the generic en.UTF-8 locale.  That's not a bug in sort, so
I'm closing this as not a bug from coreutils' perspective.  Feel free to
raise it as a glibc bug (the owner of locale definitions on GNU/Linux
systems) if you have a strong reason why different locales should be
more consistent on their choice of collation sequences.  And feel free
to reply further to this bug with more questions or comments, even
though it has been closed.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#19142; Package coreutils. (Sat, 22 Nov 2014 05:50:04 GMT) Full text and rfc822 format available.

Message #15 received at 19142 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Roland Sieker <ospalh <at> gmail.com>
Cc: 19142 <at> debbugs.gnu.org
Subject: Re: bug#19142: sort not working with LANG set to
 language_country.encoding
Date: Fri, 21 Nov 2014 22:49:41 -0700
tag 19142 notabug
close 19142
thanks

Roland Sieker wrote:
> I have noticed that sort seems to have problems when the LANG environment
> variable is set with language and country.

Sort is definitely affected by LANG because LANG sets LC_COLLATE which
controls the collation sequence.  Different locales have different
collating sequences.  I don't like that the english locales such as my
own country's en_US.UTF-8 and others like en_GB.UTF-8 don't sort
"correctly" as far as I am concerned but I can only accept it.  Sort
order is actually a libc function and affects much more than sort.  It
also affects ls and the shell and basically everything on the system
that sorts.

> It sorts OK like this, with LANG just the language.encoding:
> ( setenv LANG en.UTF-8 ; echo 'a\nb\na\n⺌\n⺕\n⺌' | sort )
> a
> a
> b

Are you sure "en.UTF-8" is a valid locale?  It doesn't look like it to
me.  I think that is an invalid locale and therefore libc is falling
back to the C/POSIX locale.

> But not with LANG as language_country.encoding:
> ( setenv LANG en_GB.UTF-8 ; echo 'a\nb\na\n⺌\n⺕\n⺌' | sort )

Here "en_GB.UTF-8" is a valid domain and en_GB.UTF-8 uses dictionary
sort ordering.  Dictionary order folds case and ignores punctuation.

Try using the newish sort --debug option.  It will help debug problems
such as this.

  $ printf "a\nb\na\n⺌\n⺕\n⺌\n" | env LC_ALL=en_US.UTF-8 sort --debug
  sort: using ‘en_US.UTF-8’ sorting rules
  ...

  $ printf "a\nb\na\n⺌\n⺕\n⺌\n" | env LC_ALL=en.UTF-8 sort --debug
  sort: using simple byte comparison
  ...

See also the FAQ entry:

  https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#19142; Package coreutils. (Sun, 23 Nov 2014 12:59:01 GMT) Full text and rfc822 format available.

Message #18 received at 19142 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Roland Sieker <ospalh <at> gmail.com>, 19142 <at> debbugs.gnu.org
Subject: Re: bug#19142: sort not working with LANG set to
 language_country.encoding
Date: Sun, 23 Nov 2014 13:58:44 +0100
On 11/21/2014 12:24 PM, Roland Sieker wrote:
> sort: sort (GNU coreutils) 8.21
> Shell: tcsh 6.18.01 (Astron) 2012-02-14 (x86_64-unknown-linux) options
> wide,nls,dl,al,kan,rh,color,filec
> Fedora Linux 20

Additionally to what Bob wrote, I want to mention that the multi-byte
support is not part of the upstream sort, but is added by the distribution,
Fedora in your case.

Have a nice day,
Berny




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 22 Dec 2014 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 216 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.