GNU bug report logs - #17188
Sort bugs

Previous Next

Package: coreutils;

Reported by: Nikos Balkanas <nbalkanas <at> gmail.com>

Date: Sat, 5 Apr 2014 02:27:02 UTC

Severity: normal

Tags: notabug

Merged with 17189

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17188 in the body.
You can then email your comments to 17188 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#17188; Package coreutils. (Sat, 05 Apr 2014 02:27:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Nikos Balkanas <nbalkanas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 05 Apr 2014 02:27:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Nikos Balkanas <nbalkanas <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Sort bugs
Date: Sat, 5 Apr 2014 05:07:02 +0300
[Message part 1 (text/plain, inline)]
Hi,

Sort is seriously bugged. This is the output from:

sort -d -t \t -k1 input > out

0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr

Shouldn't 00/0 be first according to Ascii code?

Plz fix.

TIA,
Nikos
[Message part 2 (text/html, inline)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Sat, 05 Apr 2014 12:22:02 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Sat, 05 Apr 2014 12:22:03 GMT) Full text and rfc822 format available.

Notification sent to Nikos Balkanas <nbalkanas <at> gmail.com>:
bug acknowledged by developer. (Sat, 05 Apr 2014 12:22:04 GMT) Full text and rfc822 format available.

Message #12 received at 17188-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Nikos Balkanas <nbalkanas <at> gmail.com>, 17188-done <at> debbugs.gnu.org
Subject: Re: bug#17188: Sort bugs
Date: Sat, 05 Apr 2014 06:21:33 -0600
[Message part 1 (text/plain, inline)]
tag 17188 notabug
thanks

On 04/04/2014 08:07 PM, Nikos Balkanas wrote:
> Hi,
> 
> Sort is seriously bugged. This is the output from:
> 
> sort -d -t \t -k1 input > out

-d says to do a dictionary sort that ignores non-alphanumeric
characters.  But it still leaves it up to your current locale on whether
those non-alpha characters are collated case-insensitively.

Also, '-k1' is almost always wrong - you generally want '-k1,1' if you
want to sort by JUST the first field, rather than by the whole line.

See the FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

> 
> 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
> 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
> 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
> 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
> 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
> 
> Shouldn't 00/0 be first according to Ascii code?

Only if you are asking for a full ASCII sort.  Here, I'm adding -s for
fewer lines, but using --debug can sometimes help show you where you are
asking sort to do something different than you expected, but where sort
is behaving correctly given what you asked it to do.

I'm guessing your default locale is en_US.UTF-8 - because I get the same
results as you in that mode:

$ sort --debug -s -d -t \t -k1 input
sort: using ‘en_US.UTF-8’ sorting rules
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________

In this mode, '000p' collates case-insensitively before '000Q', so the
sort is correct (the collation was on '000Q' and not '00/0Q' because you
used -d).  Furthermore, if you omit -d:

$ sort --debug -s -t \t -k1 input
sort: using ‘en_US.UTF-8’ sorting rules
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________

No change, because the en_US.UTF-8 locale implicitly does a dictionary
collation even without you requesting -d.

Now, compare to the C locale, which forces sorting by byte value for
more traditional ASCII sorting:


$ LC_ALL=C sort --debug -s -d -t \t -k1 input
sort: using simple byte comparison
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________


'000Q' now sorts before '000R' which sorts before '000p' as expected.
And toss out the -d, and you get:

$ LC_ALL=C sort --debug -s -t \t -k1 input
sort: using simple byte comparison
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________


Now '00/' sorts before '000'.

It might be a nice improvement to the --debug output to avoid putting _
under any character that sort ignored due to -d before calling strcoll()
(which would help the output of the LC_ALL=C case, but not the
en_US.UTF-8 case) - but that's probably difficult to implement.

> 
> Plz fix.

There's nothing to fix but your usage pattern.  So I'm closing this as
not a bug.  But feel free to reply further if you still have questions.


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Forcibly Merged 17188 17189. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Sat, 05 Apr 2014 12:24:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#17188; Package coreutils. (Sat, 05 Apr 2014 18:43:02 GMT) Full text and rfc822 format available.

Message #17 received at 17188-done <at> debbugs.gnu.org (full text, mbox):

From: Nikos Balkanas <nbalkanas <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: 17188-done <at> debbugs.gnu.org
Subject: Re: bug#17188: Sort bugs
Date: Sat, 5 Apr 2014 21:42:31 +0300
[Message part 1 (text/plain, inline)]
On Sat, Apr 5, 2014 at 3:21 PM, Eric Blake <eblake <at> redhat.com> wrote:

> tag 17188 notabug
> thanks
>
> On 04/04/2014 08:07 PM, Nikos Balkanas wrote:
> > Hi,
> >
> > Sort is seriously bugged. This is the output from:
> >
> > sort -d -t \t -k1 input > out
>
> -d says to do a dictionary sort that ignores non-alphanumeric
> characters.  But it still leaves it up to your current locale on whether
> those non-alpha characters are collated case-insensitively.
>
> Also, '-k1' is almost always wrong - you generally want '-k1,1' if you
> want to sort by JUST the first field, rather than by the whole line.
>

​Sorting by the first line? What is that? Sort should work on each line by
given columns

Unix man:

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F
is a field number and C a character position in the field; both are ori‐
  gin 1, and the stop position defaults to the line's end...

In retrospect this confirms your saying. However, on first look, it doesn't
make sense. An example
like the one you gave me, in the man page would save a lot of explaining.


> See the FAQ:
>
> https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
>
>
​From that link:​
​"So far there is still no fully satisfactory solution to this problem. If
you find one then please contact me so that this information can be listed."

If you are "me", then I would like to suggest that you make default the
legacy sort behaviour, and add with -c the locale support
that standards and non-English users ask for.

>
> > 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
> > 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
> > 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
> > 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
> > 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
> >
> > Shouldn't 00/0 be first according to Ascii code?
>

​I have to sort billions of these hashes in TB sizes of files. These are
followed by a <TAB> and a key (therefore the -t \t).
I was considering downloading legacy sort sources and compiling for my
system. Or taking recent sources and fixing the source.
Both dreadful aspects, because it would make my system incompatible and
inconsistent. You don't know how happy you make me,
that i can still get legacy behaviour out of the modern releases.

 There's nothing to fix but your usage pattern.  So I'm closing this as
> not a bug.  But feel free to reply further if you still have questions.
>

​UI is still a bug, though not a code bug. And legacy UI compatibility is
broken. However, I am perfectly satisfied with your fast and long
explanation of what the status is.
You will, however, go crazy if you respond like that to every user with a
locale sorting issue.  Can't you make default LOCALE=C for sorting and
allow users to change that
to the system settings using -c when they need it? Nowadays users use other
graphical tools to do sorting, sort is used mostly by scripts.

Thank you,
Nikos

>
>
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#17188; Package coreutils. (Sat, 05 Apr 2014 20:24:02 GMT) Full text and rfc822 format available.

Message #20 received at 17188 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Nikos Balkanas <nbalkanas <at> gmail.com>
Cc: 17188 <at> debbugs.gnu.org
Subject: Re: bug#17188: Sort bugs
Date: Sat, 5 Apr 2014 14:23:29 -0600
Nikos Balkanas wrote:
> Eric Blake wrote:
> > See the FAQ:
> >
> > https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
> >
> ​From that link:​
> ​"So far there is still no fully satisfactory solution to this problem. If
> you find one then please contact me so that this information can be listed."
> 
> If you are "me", then I would like to suggest that you make default
> the legacy sort behaviour, and add with -c the locale support that
> standards and non-English users ask for.

When I wrote that I did mean within the confines of continuing to
conform to the standards.  :-)

> ​UI is still a bug, though not a code bug. And legacy UI compatibility is
> broken.

Actually no.  If you were really using the legacy UI then you would be
using the legacy locale setting LC_ALL=C too.  If you aren't then you
aren't using the legacy UI.

> However, I am perfectly satisfied with your fast and long
> explanation of what the status is.
> You will, however, go crazy if you respond like that to every user with a
> locale sorting issue.

I usually rant:

You don't like it and I don't like it but the-powers-that-be have
confused working with data on a computer with talking about working
with data on a computer.  They have decided that the collation
ordering (sort ordering) for data should be dictionary ordering.  In
dictionary ordering case is folded together and punctuation is
ignored.  By having LANG set to any of the "en" locales the system is
instructed to use dictionary sort ordering.  This affects almost
everything on the system that sorts.  This includes commands such as
'ls' and also your shell (e.g. 'echo *') too.

> Can't you make default LOCALE=C for sorting and allow users to
> change that to the system settings using -c when they need it?

Actually no we can't.  That would break the opposite side of things
where people rely upon dictionary sorting based upon their chosen
locale setting.  After all of these years that would be equally bad in
the opposite way.

I am going to say "you" here but please don't take this as hostility.
It is a bad word in text email.  But I am really just trying to put
down the facts of the case.

Originally the locale was C.  If you go back to the C locale things
will be working for you as you wish it to work.  It will work as it
worked before.  Agreed?

Then you changed something.  You changed the locale.  You in your
environment set LANG=en_US.UTF-8 (or similar equivalent).  That is
when you notice that sort doesn't work as you want it to work.

Now you might say that you personally didn't make that choice but your
system vendor did.  It happened when you switched to a new machine
running a newer system or something.  Okay.  But you chose that system
vendor.  You could choose a different system vendor.  Or choose to go
back to the previous system with the previous LANG=C locale.  Or
choose to configure the new system as you wish it.  You are in control
of it.

As a pilot we have a saying, "Fly the airplane.  Don't let the
airplane fly you." :-)

You could file bugs with your system vendor that they defaulted you to
LANG=en_US.UTF-8 and ask them to allow users to choose LANG=C at
install time instead.  I have done this and unfortunately the response
from one vendor was "That was intentional." with the bug closed and
locked against further comment.  The door slammed in my face.  I am
now using a different software distribution.

> Nowadays users use other graphical tools to do sorting, sort is used
> mostly by scripts.

For you perhaps.  Not for me.  Not for many people.  I have no idea
what the survey count would be either way but it doesn't matter.
Can't make the mistake of assuming that any one environment is more
important to the exclusion of all others.

But you see the problem isn't a change in sort.  The problem is a
change in locale.  Sort is behaving as it has for years and years.
What changed was the locale that most people get by default.  It used
to be that users would get LANG=C.  But these days most users get
LANG=en_US.UTF-8.  But with a dictionary collating sort order locale
it behaves undesirably to many of us.  But to others that is exactly
what they want.  And so they wrote it into the locale.  Two opposing
viewpoints that being in opposition cannot be converged.

Note that this is bigger than just sort.  This affects everything on
your system.  It affects the shell.  Try "echo *" and look at the sort
ordering.  Same thing there.  The shell will sort by locale sort order.

The only way to fix it is to fix it at the source of the problem.  The
source is the locale collation sequence.  Which is why I always set
this in my environment.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

But while that works for most western locales I have no idea how that
would interact with chinese big5 for example.  Probably badly.  So it
can't really be offered as a general solution to the problem.  But if
you are using one of the set of western locales that it works for then
it does solve the problem for you.

I keep thinking that one of these days I should dig into it and create
my own locale.  Something like LANG=en_US.C.UTF-8 that would define a
sane sort ordering that wouldn't require LC_COLLATE=C to fix.  But
there isn't much itch to scratch there since LC_COLLATE=C does
effectively the same thing to fix the problem.  For western locales
anyway and we don't usually hear from anyone else with this problem.

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#17188; Package coreutils. (Sat, 05 Apr 2014 20:45:02 GMT) Full text and rfc822 format available.

Message #23 received at 17188 <at> debbugs.gnu.org (full text, mbox):

From: Nikos Balkanas <nbalkanas <at> gmail.com>
To: Bob Proulx <bob <at> proulx.com>
Cc: 17188 <at> debbugs.gnu.org
Subject: Re: bug#17188: Sort bugs
Date: Sat, 5 Apr 2014 23:44:52 +0300
[Message part 1 (text/plain, inline)]
On Sat, Apr 5, 2014 at 11:23 PM, Bob Proulx <bob <at> proulx.com> wrote:

[...snip...]


> Originally the locale was C.  If you go back to the C locale things
> will be working for you as you wish it to work.  It will work as it
> worked before.  Agreed?
>
> Then you changed something.  You changed the locale.  You in your
> environment set LANG=en_US.UTF-8 (or similar equivalent).  That is
> when you notice that sort doesn't work as you want it to work.
>

What about sorting input based on the input's locale, instead of the
system's? Sort
can distinguish ASCII (iso) from UTF-8 and collate accordingly. Even users
in UTF-8
systems (like me) get ASCII files smts and need to collate them correctly.
For sanity
purposes users could override the locale with a command line option.

[...snip...]

>
>   The only way to fix it is to fix it at the source of the problem.  The
> source is the locale collation sequence.  Which is why I always set
> this in my environment.
>
>   export LANG=en_US.UTF-8
>   export LC_COLLATE=C
>
> But while that works for most western locales I have no idea how that
> would interact with chinese big5 for example.  Probably badly.  So it
> can't really be offered as a general solution to the problem.  But if
> you are using one of the set of western locales that it works for then
> it does solve the problem for you.
>

That is dangerous. I don't know how much software will stop working
correctly because of that.
Of course, one can always return locale back to its original value, but
problems might not be that obvious at all :-(

[...snip...]


> Bob
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#17188; Package coreutils. (Mon, 07 Apr 2014 12:47:02 GMT) Full text and rfc822 format available.

Message #26 received at 17188 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Nikos Balkanas <nbalkanas <at> gmail.com>, Bob Proulx <bob <at> proulx.com>
Cc: 17188 <at> debbugs.gnu.org
Subject: Re: bug#17188: Sort bugs
Date: Mon, 07 Apr 2014 06:46:37 -0600
[Message part 1 (text/plain, inline)]
On 04/05/2014 02:44 PM, Nikos Balkanas wrote:

> What about sorting input based on the input's locale, instead of the
> system's?

And how do you propose to detect the input's locale?  The canonical way
to tell a program what locale the input is in is by setting the
environment variable LC_COLLATE and/or LC_ALL.

> Sort
> can distinguish ASCII (iso) from UTF-8 and collate accordingly.

ASCII is a subset of UTF-8.  There is no way to tell if input was
intended as one or the other without setting an environment variable to
make your intentions clear - but this is precisely what you already do
to get sort to do what you want.  And since this behavior is mandated by
POSIX (the behavior of LC_ALL and friend controlling how 'sort' and all
other utilities will collate, based on the definition of the chosen
locale), it is better to point people to a consistent standard that will
work across ALL implementations of 'sort', than it is to invent yet
another non-standard knob for just GNU sort.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 08 May 2014 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 48 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.