GNU bug report logs -
#23012
sort: add option to set specific locale
Previous Next
Reported by: John Heidemann <johnh <at> isi.edu>
Date: Mon, 14 Mar 2016 18:24:02 UTC
Severity: wishlist
Tags: wontfix
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23012 in the body.
You can then email your comments to 23012 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#23012
; Package
coreutils
.
(Mon, 14 Mar 2016 18:24:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
John Heidemann <johnh <at> isi.edu>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Mon, 14 Mar 2016 18:24:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Locale-specific sorting produces uprising results.
While locale-specific sorting is all as per POSIX, the details are
obscure and can be confusing.
(See for example this comment in the code:
/* Always output the locale in debug mode, since this
is such a common source of confusion. */
and "Sort does not sort in normal order!" at
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html
)
Locale-specific result can only be controlled by setting the LC_LOCALE
or LC_COLLATE environment variables. However, this approach results in
"spooky action at a distance"---it is not obvious to users, and it can
be hard to control when sort is used from other programs.
Suggested enhancement: it should be possible to specify the locale on
the command-line, making control of this feature more accessible.
A patch at
http://www.isi.edu/~johnh/SOFTWARE/sort_locale_option_160314.patch
adds --locale=WHATEVER and -L
to accomplish this goal.
The patch is against coreutils-8.24.
Please consider it for submission to coreutils.
A test case that exhibits locale-specific oddness, with current sort:
{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 1.0 is first as Kerningham intended'; } |LC_COLLATE=C sort
{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 100.0.2 is first, for fun and confusion'; } |LC_COLLATE=en_US.utf8 sort
And the happeniess that ensues from control without environment variables:
{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 1.0 is first as Kerningham intended'; } |./sort --locale=C
{ echo '100.0.2'; echo '1.0.2'; echo '1x0.2'; echo 'the 100.0.2 is first, for fun and confusion'; } | ./sort --locale=en_US.utf8
If the coreutils maintainers consider this patch acceptable, I will also
write a patch that updates the documentation.
(You may also want to add this option across other tools. That part is
left as an exercise to the reader. :-)
-John Heidemann
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#23012
; Package
coreutils
.
(Mon, 14 Mar 2016 18:40:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 23012 <at> debbugs.gnu.org (full text, mbox):
On 03/14/2016 11:02 AM, John Heidemann wrote:
> A test case that exhibits locale-specific oddness, with current sort:
>
> ... LC_COLLATE=C sort ...
>
> And the happeniess that ensues from control without environment variables:
>
> ... sort --locale=C ...
>
I dunno, these approaches seem about the same to me. And 'LC_ALL=C sort'
is standardized whereas 'sort --locale=C' is not, which is a significant
advantage for portable scripts. And if we added the --locale option to
'sort', for consistency we'd need to add it to uniq, awk, grep, etc.,
etc., and document all this, and explain why there are two ways to
specify the same thing and that one overrides the other, etc., etc. Is
the minor benefit worth all this hassle?
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#23012
; Package
coreutils
.
(Mon, 14 Mar 2016 20:13:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 23012 <at> debbugs.gnu.org (full text, mbox):
On Mon, 14 Mar 2016 11:39:37 -0700, Paul Eggert wrote:
>On 03/14/2016 11:02 AM, John Heidemann wrote:
>> A test case that exhibits locale-specific oddness, with current sort:
>>
>> ... LC_COLLATE=C sort ...
>>
>> And the happeniess that ensues from control without environment variables:
>>
>> ... sort --locale=C ...
>>
>
>I dunno, these approaches seem about the same to me. And 'LC_ALL=C
>sort' is standardized whereas 'sort --locale=C' is not, which is a
>significant advantage for portable scripts. And if we added the
>--locale option to 'sort', for consistency we'd need to add it to
>uniq, awk, grep, etc., etc., and document all this, and explain why
>there are two ways to specify the same thing and that one overrides
>the other, etc., etc. Is the minor benefit worth all this hassle?
0. I would suggest that sort has a problem, as shown by the comment in
the code and a large FAQ entry (two of them: one for sort and one for
ls). People are confused---it would be nice to change something to
reduce confusion.
You're right that there are two questions:
1- is the API with arguments any better
2- does it need to be uniform across all utilities?
1. About arguments, environment variables vs. CLI approaches are quite different.
In a shell script, you're right that
LC_COLLATE=C sort
vs
sort --locale=C
are about the same.
Except, one might instead put LC_COLLATE elsewhere in the script
export LC_COLLATE=C
# 100 lines of shell
sort
# now with correct behavior depending on a global variable 100 lines ago
Things look even more different from C, where it is setenv("LC_COLLATE",
"C", 1); vs. arguments to execve.
Is CLI *better*? I suggest slightly better, but not that much better
(by itself).
Where it wins (I think) is that it is more regular with how other things
are done with sort. It would appear in the man page.
(By the way: if you don't take the patch, (a) I encourage you to copy the
text about LC_COLLATE from the info page to the manual page, and (b) it
looks like (in the code) the monetary aspects of locale also affects
sorting. That is not mentioned in either info or man. These changes
might address some of the confusion raised in #0 above.)
2. does it have to be across all utilities?
Maybe in the fullness of time. Or maybe not.
For me, sort is particularly important because some apps depend on
its output. In other tools (like ls, from the FAQ), output order
doesn't usually affect correctness.
The specific use case that led me here is that Hadoop wants sorted
input to the reduce phase. By default it uses a Java-based sort with
sort(1)-style arguments. However, it ignores locale. To be compatible,
one must run GNU sort with LC_COLLATE=C, and figuring that out is not at
all obvious.
-John Heidemann
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#23012
; Package
coreutils
.
(Tue, 15 Mar 2016 07:18:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 23012 <at> debbugs.gnu.org (full text, mbox):
On 03/14/2016 07:39 PM, Paul Eggert wrote:
> Is the minor benefit worth all this hassle?
I'm afraid not. Furthermore, I've seen several "creative" approaches
of day-to-day users for spelling the locale: is it upper-case? Use a dot
or a minus or an underscore between the 1st and 2nd part? Is the country
"de" or why would it be "de_de" or "eu_de" or ...? Is is "utf-8" or
"UTF.8" or "utf8"? TBC.
After all, the only case I see is that users are unhappy with the sorting
of their non-C locale, or put the other way round: probably the only value
needed is --locale=C which could then be a fix option like --C-locale.
Yet, as there's "LC_ALL=C sort ...", this is still redundant.
Have a nice day,
Berny
Changed bug title to 'sort: add option to set specific locale' from 'add option to specific locale to sort'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Thu, 25 Oct 2018 16:08:02 GMT)
Full text and
rfc822 format available.
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Thu, 25 Oct 2018 16:08:02 GMT)
Full text and
rfc822 format available.
Added tag(s) wontfix.
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Thu, 25 Oct 2018 16:08:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
23012 <at> debbugs.gnu.org and John Heidemann <johnh <at> isi.edu>
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Thu, 25 Oct 2018 16:08:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 23 Nov 2018 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 212 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.