GNU bug report logs - #18273
sort seems to misbehave if both -u and -n or -k are used

Previous Next

Package: coreutils;

Reported by: "Lennart Sorensen" <lsorense <at> csclub.uwaterloo.ca>

Date: Fri, 15 Aug 2014 19:32:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18273 in the body.
You can then email your comments to 18273 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#18273; Package coreutils. (Fri, 15 Aug 2014 19:32:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Lennart Sorensen" <lsorense <at> csclub.uwaterloo.ca>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 15 Aug 2014 19:32:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Lennart Sorensen" <lsorense <at> csclub.uwaterloo.ca>
To: bug-coreutils <at> gnu.org
Cc: Len Sorensen <lsorense <at> csclub.uwaterloo.ca>
Subject: sort seems to misbehave if both -u and -n or -k are used
Date: Fri, 15 Aug 2014 15:30:11 -0400
Here is the case that has me thinking there is a bug (it sure doesn't
make sense as valid behaviour).

input:

Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u7
Version: 1.0.1e-2+deb7u11

OK output using 'sort':

Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u7

OK output using 'sort -u':

Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u7

OK output using 'sort -n':

Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u7

(I may have hoped that one would sort by the last number given everything
else is equal, but I did not expect it to actually do so).

OK output using 'sort -k 3':

Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u11
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u12
Version: 1.0.1e-2+deb7u7

Weird output using 'sort -n -u':

Version: 1.0.1e-2+deb7u12

Weird output using 'sort -k 3 -u':

Version: 1.0.1e-2+deb7u12

So is this actually the expected behaviour?  I would have thought from
the documentation that -u would return unique lines of output, not just
one line based on whatever sort key it happened to look at.

-- 
Len Sorensen




Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Fri, 15 Aug 2014 19:50:02 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Fri, 15 Aug 2014 19:50:03 GMT) Full text and rfc822 format available.

Notification sent to "Lennart Sorensen" <lsorense <at> csclub.uwaterloo.ca>:
bug acknowledged by developer. (Fri, 15 Aug 2014 19:50:04 GMT) Full text and rfc822 format available.

Message #12 received at 18273-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Lennart Sorensen <lsorense <at> csclub.uwaterloo.ca>, 18273-done <at> debbugs.gnu.org
Subject: Re: bug#18273: sort seems to misbehave if both -u and -n or -k are
 used
Date: Fri, 15 Aug 2014 13:48:57 -0600
[Message part 1 (text/plain, inline)]
tag 18273 notabug
thanks

On 08/15/2014 01:30 PM, Lennart Sorensen wrote:
> Here is the case that has me thinking there is a bug (it sure doesn't
> make sense as valid behaviour).

Thanks for the report.  However, the behavior you have demonstrated is
required by POSIX, and is therefore not a bug.  The --debug option can
be used to see what is really happening.


> 
> OK output using 'sort -n':
> 
> Version: 1.0.1e-2+deb7u11
> Version: 1.0.1e-2+deb7u11
> Version: 1.0.1e-2+deb7u12
> Version: 1.0.1e-2+deb7u12
> Version: 1.0.1e-2+deb7u7
> 
> (I may have hoped that one would sort by the last number given everything
> else is equal, but I did not expect it to actually do so).

Actually, using -n without any other hints says to treat _the entire
line_ as a number, and to quit parsing as soon as a non-numeric portion
is found.  Observe:

$ LC_ALL=C sort foo --debug -n
sort: using simple byte comparison
Version: 1.0.1e-2+deb7u11
^ no match for key
_________________________
Version: 1.0.1e-2+deb7u11
^ no match for key
_________________________
Version: 1.0.1e-2+deb7u12
^ no match for key
_________________________
Version: 1.0.1e-2+deb7u12
^ no match for key
_________________________
Version: 1.0.1e-2+deb7u7
^ no match for key
________________________

Furthermore, if you disable the last-resort comparison of the entire
line, then you get the input order, since all of your keys were
identically the empty numeric string at the front of the line:

$ LC_ALL=C sort foo --debug -n -s
sort: using simple byte comparison
Version: 1.0.1e-2+deb7u12
^ no match for key
Version: 1.0.1e-2+deb7u11
^ no match for key
Version: 1.0.1e-2+deb7u12
^ no match for key
Version: 1.0.1e-2+deb7u7
^ no match for key
Version: 1.0.1e-2+deb7u11
^ no match for key

> 
> OK output using 'sort -k 3':
> 
> Version: 1.0.1e-2+deb7u11
> Version: 1.0.1e-2+deb7u11
> Version: 1.0.1e-2+deb7u12
> Version: 1.0.1e-2+deb7u12
> Version: 1.0.1e-2+deb7u7

Umm, here, you don't HAVE a key 3.  Again, as soon as you disable
last-resort comparison, you get the original input order:

$ LC_ALL=C sort foo --debug -k3 -s
sort: using simple byte comparison
Version: 1.0.1e-2+deb7u12
                         ^ no match for key
Version: 1.0.1e-2+deb7u11
                         ^ no match for key
Version: 1.0.1e-2+deb7u12
                         ^ no match for key
Version: 1.0.1e-2+deb7u7
                        ^ no match for key
Version: 1.0.1e-2+deb7u11
                         ^ no match for key

> 
> Weird output using 'sort -n -u':
> 
> Version: 1.0.1e-2+deb7u12

No, perfectly defined output.  -u implictly enables -s, and I already
demonstrated that -n on your input picks the initial empty string.
Since all 5 lines have the same sort key, there is only one unique key
seen, and the output is exactly the first line with that unique sort
key.  If you want to FORCE entire-line fallback, then request that as a
fallback key (since -n by itself is global to all keys, I instead
request two keys: the first as the numeric sort of the first field, the
second as the fallback sort of the entire line):

$ LC_ALL=C sort foo --debug -k1,1n -k1 -u
sort: using simple byte comparison
Version: 1.0.1e-2+deb7u11
^ no match for key
_________________________
Version: 1.0.1e-2+deb7u12
^ no match for key
_________________________
Version: 1.0.1e-2+deb7u7
^ no match for key
________________________


> 
> Weird output using 'sort -k 3 -u':
> 
> Version: 1.0.1e-2+deb7u12

Again, as proven above, all 5 lines have the same empty string (no such
key at the end of the line), so the unique output is correct.

> 
> So is this actually the expected behaviour?  I would have thought from
> the documentation that -u would return unique lines of output, not just
> one line based on whatever sort key it happened to look at.

Yes, sort -u is required to treat lines as unique solely based on the
key(s) they were sorted by (and ignoring the default last-resort key,
since -u implicitly disables -s).

As this behavior is required by POSIX and consistent with other
implementations, I'm closing it as not a bug.  But if you have further
comments or questions, you can continue to reply to this email.

By the way, have you looked at sort -V, as a way to get what you appear
to want?

$ LC_ALL=C sort foo --debug -V -u
sort: using simple byte comparison
Version: 1.0.1e-2+deb7u7
________________________
Version: 1.0.1e-2+deb7u11
_________________________
Version: 1.0.1e-2+deb7u12
_________________________

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#18273; Package coreutils. (Fri, 15 Aug 2014 20:23:02 GMT) Full text and rfc822 format available.

Message #15 received at 18273 <at> debbugs.gnu.org (full text, mbox):

From: "Lennart Sorensen" <lsorense <at> csclub.uwaterloo.ca>
To: 18273 <at> debbugs.gnu.org
Subject: Re: bug#18273: closed (Re: bug#18273: sort seems to misbehave if
 both -u and -n or -k are used)
Date: Fri, 15 Aug 2014 16:22:07 -0400
On Fri, Aug 15, 2014 at 07:50:04PM +0000, GNU bug Tracking System wrote:
> Your bug report
> 
> #18273: sort seems to misbehave if both -u and -n or -k are used
> 
> which was filed against the coreutils package, has been closed.
> 
> The explanation is attached below, along with your original report.
> If you require more details, please reply to 18273 <at> debbugs.gnu.org.
> 
> -- 
> 18273: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18273
> GNU Bug Tracking System
> Contact help-debbugs <at> gnu.org with problems

> From: Eric Blake <eblake <at> redhat.com>
> To: Lennart Sorensen <lsorense <at> csclub.uwaterloo.ca>,
>  18273-done <at> debbugs.gnu.org
> Subject: Re: bug#18273: sort seems to misbehave if both -u and -n or -k are
>  used
> 
> tag 18273 notabug
> thanks
> 
> On 08/15/2014 01:30 PM, Lennart Sorensen wrote:
> > Here is the case that has me thinking there is a bug (it sure doesn't
> > make sense as valid behaviour).
> 
> Thanks for the report.  However, the behavior you have demonstrated is
> required by POSIX, and is therefore not a bug.  The --debug option can
> be used to see what is really happening.

OK I accept that it is correct behaviour.

The documentation on the other hand is awful in that case.  I went and
checked the documentation to try and make sense of what it was doing
before sending the report, and there was nothing there that gave any
hint that this was expected behaviour.

Why does it have a blob talking about which options implicitly enable -s,
rather than mention that in the documentation for the options that do it.

Why does it not mention for -n that anything that isn't a number is
ignored and treated as if it didn't exist when it comes to deciding
things like uniqueness?  Are people expected to go read the posix
standard instead?

-- 
Len Sorensen




Information forwarded to bug-coreutils <at> gnu.org:
bug#18273; Package coreutils. (Fri, 15 Aug 2014 20:33:02 GMT) Full text and rfc822 format available.

Message #18 received at 18273 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Lennart Sorensen <lsorense <at> csclub.uwaterloo.ca>, 18273 <at> debbugs.gnu.org
Subject: Re: bug#18273: closed (Re: bug#18273: sort seems to misbehave if
 both -u and -n or -k are used)
Date: Fri, 15 Aug 2014 14:32:14 -0600
[Message part 1 (text/plain, inline)]
On 08/15/2014 02:22 PM, Lennart Sorensen wrote:

> OK I accept that it is correct behaviour.
> 
> The documentation on the other hand is awful in that case.  I went and
> checked the documentation to try and make sense of what it was doing
> before sending the report, and there was nothing there that gave any
> hint that this was expected behaviour.

'info sort' says:

  The '--stable' ('-s') option
disables this "last-resort comparison" so that lines in which all fields
compare equal are left in their original relative order.  The '--unique'
('-u') option also disables the last-resort comparison.

and later on:

'-u'
'--unique'

     Normally, output only the first of a sequence of lines that compare
     equal.  For the '--check' ('-c' or '-C') option, check that no pair
     of consecutive lines compares equal.

     This option also disables the default last-resort comparison.

     The commands 'sort -u' and 'sort | uniq' are equivalent, but this
     equivalence does not extend to arbitrary 'sort' options.  For
     example, 'sort -n -u' inspects only the value of the initial
     numeric string when checking for uniqueness, whereas 'sort -n |
     uniq' inspects the entire line.  *Note uniq invocation::.


> 
> Why does it have a blob talking about which options implicitly enable -s,
> rather than mention that in the documentation for the options that do it.

-u is the only option that implicitly enables -s.

You are welcome to propose a patch to the documentation that would
clarify the situation; we can reopen this bug if a patch materializes.
Maybe even a change to 'sort --help' output to mention that -u implies
-s (which would also feed the 'man sort' page).

> 
> Why does it not mention for -n that anything that isn't a number is
> ignored and treated as if it didn't exist when it comes to deciding
> things like uniqueness?  Are people expected to go read the posix
> standard instead?

The info page DOES mention this:

'-n'
'--numeric-sort'
'--sort=numeric'
     Sort numerically.  The number begins each line and consists of
     optional blanks, an optional '-' sign, and zero or more digits
     possibly separated by thousands separators, optionally followed by
     a decimal-point character and zero or more digits.  An empty number
     is treated as '0'.  The 'LC_NUMERIC' locale specifies the
     decimal-point character and thousands separator.  By default a
     blank is a space or a tab, but the 'LC_CTYPE' locale can change
     this.

The --help output is intentionally terse, so I don't know what we could
do there to make it more obvious without exploding the size of what is
supposed to be brief.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#18273; Package coreutils. (Fri, 15 Aug 2014 21:06:01 GMT) Full text and rfc822 format available.

Message #21 received at 18273 <at> debbugs.gnu.org (full text, mbox):

From: "Lennart Sorensen" <lsorense <at> csclub.uwaterloo.ca>
To: Eric Blake <eblake <at> redhat.com>
Cc: 18273 <at> debbugs.gnu.org
Subject: Re: bug#18273: closed (Re: bug#18273: sort seems to misbehave if
 both -u and -n or -k are used)
Date: Fri, 15 Aug 2014 17:05:23 -0400
On Fri, Aug 15, 2014 at 02:32:14PM -0600, Eric Blake wrote:
> 'info sort' says:
> 
>   The '--stable' ('-s') option
> disables this "last-resort comparison" so that lines in which all fields
> compare equal are left in their original relative order.  The '--unique'
> ('-u') option also disables the last-resort comparison.
> 
> and later on:
> 
> '-u'
> '--unique'
> 
>      Normally, output only the first of a sequence of lines that compare
>      equal.  For the '--check' ('-c' or '-C') option, check that no pair
>      of consecutive lines compares equal.
> 
>      This option also disables the default last-resort comparison.
> 
>      The commands 'sort -u' and 'sort | uniq' are equivalent, but this
>      equivalence does not extend to arbitrary 'sort' options.  For
>      example, 'sort -n -u' inspects only the value of the initial
>      numeric string when checking for uniqueness, whereas 'sort -n |
>      uniq' inspects the entire line.  *Note uniq invocation::.

OK I guess that does somewhat point out the behaviour.

> -u is the only option that implicitly enables -s.
> 
> You are welcome to propose a patch to the documentation that would
> clarify the situation; we can reopen this bug if a patch materializes.
> Maybe even a change to 'sort --help' output to mention that -u implies
> -s (which would also feed the 'man sort' page).

I do wonder why there isn't an option to undo that implicit option,
but perhaps it would not actually make sense.

> The info page DOES mention this:
> 
> '-n'
> '--numeric-sort'
> '--sort=numeric'
>      Sort numerically.  The number begins each line and consists of
>      optional blanks, an optional '-' sign, and zero or more digits
>      possibly separated by thousands separators, optionally followed by
>      a decimal-point character and zero or more digits.  An empty number
>      is treated as '0'.  The 'LC_NUMERIC' locale specifies the
>      decimal-point character and thousands separator.  By default a
>      blank is a space or a tab, but the 'LC_CTYPE' locale can change
>      this.
> 
> The --help output is intentionally terse, so I don't know what we could
> do there to make it more obvious without exploding the size of what is
> supposed to be brief.

Well I always thought info was meant to be complete documentation.

I see nothing in the above that makes me think it would ignore the part
of the line that isn't a number.  The part in -u does seem to point out
that this is the behaviour.

I think this might be the first time I ever used -n when the input was
not pure numbers, so I never hit this before.

-- 
Len Sorensen




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 13 Sep 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 278 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.