GNU bug report logs - #23012
sort: add option to set specific locale

Previous Next

Package: coreutils;

Reported by: John Heidemann <johnh <at> isi.edu>

Date: Mon, 14 Mar 2016 18:24:02 UTC

Severity: wishlist

Tags: wontfix

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23012 in the body.
You can then email your comments to 23012 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#23012; Package coreutils. (Mon, 14 Mar 2016 18:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to John Heidemann <johnh <at> isi.edu>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 14 Mar 2016 18:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: John Heidemann <johnh <at> isi.edu>
To: bug-coreutils <at> gnu.org
Subject: add option to specific locale to sort
Date: Mon, 14 Mar 2016 11:02:55 -0700
Locale-specific sorting produces uprising results.
While locale-specific sorting is all as per POSIX, the details are
obscure and can be confusing.

(See for example this comment in the code:
      /* Always output the locale in debug mode, since this
         is such a common source of confusion.  */
and "Sort does not sort in normal order!" at
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html
)

Locale-specific result can only be controlled by setting the LC_LOCALE
or LC_COLLATE environment variables.  However, this approach results in
"spooky action at a distance"---it is not obvious to users, and it can
be hard to control when sort is used from other programs.


Suggested enhancement: it should be possible to specify the locale on
the command-line, making control of this feature more accessible.


A patch at
http://www.isi.edu/~johnh/SOFTWARE/sort_locale_option_160314.patch
adds --locale=WHATEVER and -L
to accomplish this goal.

The patch is against coreutils-8.24.

Please consider it for submission to coreutils.


A test case that exhibits locale-specific oddness, with current sort:

{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 1.0 is first as Kerningham intended'; } |LC_COLLATE=C sort

{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 100.0.2 is first, for fun and confusion'; } |LC_COLLATE=en_US.utf8 sort


And the happeniess that ensues from control without environment variables:

{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 1.0 is first as Kerningham intended'; } |./sort --locale=C

{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 100.0.2 is first, for fun and confusion'; } | ./sort --locale=en_US.utf8


If the coreutils maintainers consider this patch acceptable, I will also
write a patch that updates the documentation.

(You may also want to add this option across other tools.  That part is
left as an exercise to the reader. :-)

   -John Heidemann




Information forwarded to bug-coreutils <at> gnu.org:
bug#23012; Package coreutils. (Mon, 14 Mar 2016 18:40:01 GMT) Full text and rfc822 format available.

Message #8 received at 23012 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: John Heidemann <johnh <at> isi.edu>, 23012 <at> debbugs.gnu.org
Subject: Re: bug#23012: add option to specific locale to sort
Date: Mon, 14 Mar 2016 11:39:37 -0700
On 03/14/2016 11:02 AM, John Heidemann wrote:
> A test case that exhibits locale-specific oddness, with current sort:
>
> ... LC_COLLATE=C sort ...
>
> And the happeniess that ensues from control without environment variables:
>
> ... sort --locale=C ...
>

I dunno, these approaches seem about the same to me. And 'LC_ALL=C sort' 
is standardized whereas 'sort --locale=C' is not, which is a significant 
advantage for portable scripts. And if we added the --locale option to 
'sort', for consistency we'd need to add it to uniq, awk, grep, etc., 
etc., and document all this, and explain why there are two ways to 
specify the same thing and that one overrides the other, etc., etc. Is 
the minor benefit worth all this hassle?




Information forwarded to bug-coreutils <at> gnu.org:
bug#23012; Package coreutils. (Mon, 14 Mar 2016 20:13:01 GMT) Full text and rfc822 format available.

Message #11 received at 23012 <at> debbugs.gnu.org (full text, mbox):

From: John Heidemann <johnh <at> isi.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 23012 <at> debbugs.gnu.org
Subject: Re: bug#23012: add option to specific locale to sort
Date: Mon, 14 Mar 2016 13:12:12 -0700
On Mon, 14 Mar 2016 11:39:37 -0700, Paul Eggert wrote: 
>On 03/14/2016 11:02 AM, John Heidemann wrote:
>> A test case that exhibits locale-specific oddness, with current sort:
>>
>> ... LC_COLLATE=C sort ...
>>
>> And the happeniess that ensues from control without environment variables:
>>
>> ... sort --locale=C ...
>>
>
>I dunno, these approaches seem about the same to me. And 'LC_ALL=C
>sort' is standardized whereas 'sort --locale=C' is not, which is a
>significant advantage for portable scripts. And if we added the
>--locale option to 'sort', for consistency we'd need to add it to
>uniq, awk, grep, etc., etc., and document all this, and explain why
>there are two ways to specify the same thing and that one overrides
>the other, etc., etc. Is the minor benefit worth all this hassle?

0. I would suggest that sort has a problem, as shown by the comment in
the code and a large FAQ entry (two of them: one for sort and one for
ls).  People are confused---it would be nice to change something to
reduce confusion.


You're right that there are two questions:

1- is the API with arguments any better
2- does it need to be uniform across all utilities?


1. About arguments, environment variables vs. CLI approaches are quite different.


In a shell script, you're right that

LC_COLLATE=C sort

vs

sort --locale=C

are about the same.

Except, one might instead put LC_COLLATE elsewhere in the script

export LC_COLLATE=C
# 100 lines of shell
sort
# now with correct behavior depending on a global variable 100 lines ago


Things look even more different from C, where it is setenv("LC_COLLATE",
"C", 1); vs. arguments to execve.


Is CLI *better*?   I suggest slightly better, but not that much better
(by itself).

Where it wins (I think) is that it is more regular with how other things
are done with sort.  It would appear in the man page.

(By the way: if you don't take the patch, (a) I encourage you to copy the
text about LC_COLLATE from the info page to the manual page, and (b) it
looks like (in the code) the monetary aspects of locale also affects
sorting.   That is not mentioned in either info or man.  These changes
might address some of the confusion raised in #0 above.)


2. does it have to be across all utilities?

Maybe in the fullness of time.  Or maybe not.

For me, sort is particularly important because some apps depend on
its output.  In other tools (like ls, from the FAQ), output order
doesn't usually affect correctness.

The specific use case that led me here is that Hadoop wants sorted
input to the reduce phase.  By default it uses a Java-based sort with
sort(1)-style arguments.  However, it ignores locale.  To be compatible,
one must run GNU sort with LC_COLLATE=C, and figuring that out is not at
all obvious.


   -John Heidemann





Information forwarded to bug-coreutils <at> gnu.org:
bug#23012; Package coreutils. (Tue, 15 Mar 2016 07:18:01 GMT) Full text and rfc822 format available.

Message #14 received at 23012 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>, John Heidemann <johnh <at> isi.edu>,
 23012 <at> debbugs.gnu.org
Subject: Re: bug#23012: add option to specific locale to sort
Date: Tue, 15 Mar 2016 08:17:07 +0100
On 03/14/2016 07:39 PM, Paul Eggert wrote:
>  Is the minor benefit worth all this hassle?

I'm afraid not.  Furthermore, I've seen several "creative" approaches
of day-to-day users for spelling the locale: is it upper-case? Use a dot
or a minus or an underscore between the 1st and 2nd part?  Is the country
"de" or why would it be "de_de" or "eu_de" or ...? Is is "utf-8" or
"UTF.8" or "utf8"?  TBC.

After all, the only case I see is that users are unhappy with the sorting
of their non-C locale, or put the other way round: probably the only value
needed is --locale=C which could then be a fix option like --C-locale.
Yet, as there's "LC_ALL=C sort ...", this is still redundant.

Have a nice day,
Berny




Changed bug title to 'sort: add option to set specific locale' from 'add option to specific locale to sort' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 25 Oct 2018 16:08:02 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 25 Oct 2018 16:08:02 GMT) Full text and rfc822 format available.

Added tag(s) wontfix. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 25 Oct 2018 16:08:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 23012 <at> debbugs.gnu.org and John Heidemann <johnh <at> isi.edu> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 25 Oct 2018 16:08:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 23 Nov 2018 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 212 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.