GNU bug report logs - #9995
problem about sort -u -k

Previous Next

Package: coreutils;

Reported by: 夏凯 <walkerxk <at> gmail.com>

Date: Tue, 8 Nov 2011 17:25:15 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9995 in the body.
You can then email your comments to 9995 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#9995; Package coreutils. (Tue, 08 Nov 2011 17:25:15 GMT) Full text and rfc822 format available.

Acknowledgement sent to 夏凯 <walkerxk <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 08 Nov 2011 17:25:16 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: 夏凯 <walkerxk <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: problem about sort -u -k
Date: Tue, 8 Nov 2011 22:49:12 +0800
when i use sort command with -k and -n together, i got the wrong result:
22:41:21#tp#~> LC_ALL=C
22:41:39#tp#~> /usr/local/bin/sort -u -k1,3 a
1 a q
1 a w
3 a w
22:41:48#tp#~> /usr/local/bin/sort -u -k3 a
1 a q
1 a w
22:41:49#tp#~> cat a
1 a q
1 a w
3 a w
22:41:52#tp#~> /usr/local/bin/sort --version
sort (GNU coreutils) 8.14
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.
22:41:57#tp#~>
why is that?
i read http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html,
but got nothing about this.
any help is appreciate.
-- 
contact me:
MSN: walkerxk <at> gmail.com
GTALK: walkerxk <at> gmail.com




Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Tue, 08 Nov 2011 18:55:02 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Tue, 08 Nov 2011 18:55:04 GMT) Full text and rfc822 format available.

Notification sent to 夏凯 <walkerxk <at> gmail.com>:
bug acknowledged by developer. (Tue, 08 Nov 2011 18:55:04 GMT) Full text and rfc822 format available.

Message #12 received at 9995-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: 夏凯 <walkerxk <at> gmail.com>
Cc: 9995-done <at> debbugs.gnu.org
Subject: Re: bug#9995: problem about sort -u -k
Date: Tue, 08 Nov 2011 11:54:17 -0700
tag 9995 notabug
thanks

On 11/08/2011 07:49 AM, 夏凯 wrote:
> when i use sort command with -k and -n together, i got the wrong result:

Thanks for the report; however, this is most likely not a bug in sort, 
but in your usage patterns.  Your sentence mentioned -k and -n together, 
but your example and subject line mentioned -u and -k together; so I'll 
assume that you got surprised by -u, not -n.

> 22:41:21#tp#~>  LC_ALL=C

Unless you also did 'export LC_ALL' at some point, this does not 
guarantee that child processes will see this setting in their environment.

> 22:41:39#tp#~>  /usr/local/bin/sort -u -k1,3 a
> 1 a q
> 1 a w
> 3 a w
> 22:41:48#tp#~>  /usr/local/bin/sort -u -k3 a
> 1 a q
> 1 a w
> 22:41:49#tp#~>  cat a
> 1 a q
> 1 a w
> 3 a w
> 22:41:52#tp#~>  /usr/local/bin/sort --version
> sort (GNU coreutils) 8.14

That's new enough that you can use the --debug option to see what was 
really going on:

$ LC_ALL=C ../coreutils/src/sort --debug -u -k1,3 a
sort: using simple byte comparison
1 a q
_____
1 a w
_____
3 a w
_____

Here, you compared all three lines, which were all distinct.

$ LC_ALL=C ../coreutils/src/sort --debug -u -k3 a
sort: using simple byte comparison
1 a q
   __
1 a w
   __

Here, you told sort to only look at a key of field three onwards, and to 
uniquify the results (that is, don't display multiple lines if they had 
the same sort key).  Since two lines both have the string " w" as the 
-k3 key, sort -u picked one of those lines (namely "3 a w") to be 
discarded on output.  This behavior matches POSIX rules.

Since you didn't tell us what output you were hoping to get, I can't 
tell you the proper command line that would match your expected output. 
 Feel free to reply, even while this bug is closed, if you need more 
help in getting the output you want.  Also, if you can prove that sort 
is doing something wrong, then feel free to reopen this bug with more 
evidence of why it is a bug in sort, including --debug output to back up 
your claim (but be aware that more than 90% of "bug" reports against 
sort have been debunked as user error rather than an actual bug in sort).

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org




Information forwarded to bug-coreutils <at> gnu.org:
bug#9995; Package coreutils. (Tue, 08 Nov 2011 19:46:02 GMT) Full text and rfc822 format available.

Message #15 received at 9995 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: 9995 <at> debbugs.gnu.org, 夏凯 <walkerxk <at> gmail.com>
Subject: Re: bug#9995: problem about sort -u -k
Date: Tue, 08 Nov 2011 12:45:11 -0700
On 11/08/2011 11:54 AM, Eric Blake wrote:
>> 22:41:39#tp#~> /usr/local/bin/sort -u -k1,3 a
>> 1 a q
>> 1 a w
>> 3 a w
>> 22:41:48#tp#~> /usr/local/bin/sort -u -k3 a
>> 1 a q
>> 1 a w

> Since you didn't tell us what output you were hoping to get, I can't
> tell you the proper command line that would match your expected output.
> Feel free to reply, even while this bug is closed, if you need more help
> in getting the output you want.

I'll give a preemptive attempt at guessing what you meant, as well:

If you wanted to sort on just the third and subsequent fields, but then 
strip duplicate lines only if the entire line is duplicate, then you 
have to use two processes:

sort [-s] -k3 a | uniq

If you don't mind a two-key sort, where the primary key is the third and 
subsequent fields, but where the secondary key is the entire line so as 
to force sort -u to consider the entire line when determining 
uniqueness, then one process will do:

sort -u -k3 -k1 a

To see the difference, and remembering that sort -u implies sort -s, 
consider these contents for a:

$ cat a
1 a q
2 a q
1 a q
1 a w
3 a w
$ sort -u -k3 -k1 a
1 a q
2 a q
1 a w
3 a w
$ sort -s -k3 a | uniq
1 a q
2 a q
1 a q
1 a w
3 a w
$ sort -k3 a | uniq
1 a q
2 a q
1 a w
3 a w

That is, if the stable sort of just -k3 leaves identical lines that are 
not adjacent ("1 a q" in my example), then the separate uniq process 
won't filter them; while using sort -u with -k1 as the means to force 
the entire line as a secondary sort key loses the ability to leave 
identical lines separated by a distinct line.  Likewise, omitting both 
-s and -u lets sort imply a last-resort -k1, at which point uniq sees 
the same line order as sort -u sees.

>> i read 
http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html,
>> but got nothing about this.

Actually, it does - under the option -u, I see:

The commands sort -u and sort | uniq are equivalent, but this 
equivalence does not extend to arbitrary sort options. For example, sort 
-n -u inspects only the value of the initial numeric string when 
checking for uniqueness, whereas sort -n | uniq inspects the entire 
line. See uniq invocation.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org




Information forwarded to bug-coreutils <at> gnu.org:
bug#9995; Package coreutils. (Wed, 09 Nov 2011 14:04:02 GMT) Full text and rfc822 format available.

Message #18 received at 9995 <at> debbugs.gnu.org (full text, mbox):

From: 夏凯 <walkerxk <at> gmail.com>
To: 9995 <at> debbugs.gnu.org
Subject: Re: bug#9995: problem about sort -u -k
Date: Wed, 9 Nov 2011 22:02:26 +0800
thanks for you reply.
if i want to use the entire line as a key, and sort by the third
field, whether should i use sort -u -k3 -k1 -k2 a to do that?

On Wed, Nov 9, 2011 at 03:45, Eric Blake <eblake <at> redhat.com> wrote:
> On 11/08/2011 11:54 AM, Eric Blake wrote:
>>>
>>> 22:41:39#tp#~> /usr/local/bin/sort -u -k1,3 a
>>> 1 a q
>>> 1 a w
>>> 3 a w
>>> 22:41:48#tp#~> /usr/local/bin/sort -u -k3 a
>>> 1 a q
>>> 1 a w
>
>> Since you didn't tell us what output you were hoping to get, I can't
>> tell you the proper command line that would match your expected output.
>> Feel free to reply, even while this bug is closed, if you need more help
>> in getting the output you want.
>
> I'll give a preemptive attempt at guessing what you meant, as well:
>
> If you wanted to sort on just the third and subsequent fields, but then
> strip duplicate lines only if the entire line is duplicate, then you have to
> use two processes:
>
> sort [-s] -k3 a | uniq
>
> If you don't mind a two-key sort, where the primary key is the third and
> subsequent fields, but where the secondary key is the entire line so as to
> force sort -u to consider the entire line when determining uniqueness, then
> one process will do:
>
> sort -u -k3 -k1 a
>
> To see the difference, and remembering that sort -u implies sort -s,
> consider these contents for a:
>
> $ cat a
> 1 a q
> 2 a q
> 1 a q
> 1 a w
> 3 a w
> $ sort -u -k3 -k1 a
> 1 a q
> 2 a q
> 1 a w
> 3 a w
> $ sort -s -k3 a | uniq
> 1 a q
> 2 a q
> 1 a q
> 1 a w
> 3 a w
> $ sort -k3 a | uniq
> 1 a q
> 2 a q
> 1 a w
> 3 a w
>
> That is, if the stable sort of just -k3 leaves identical lines that are not
> adjacent ("1 a q" in my example), then the separate uniq process won't
> filter them; while using sort -u with -k1 as the means to force the entire
> line as a secondary sort key loses the ability to leave identical lines
> separated by a distinct line.  Likewise, omitting both -s and -u lets sort
> imply a last-resort -k1, at which point uniq sees the same line order as
> sort -u sees.
>
>>> i read
>>> http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html,
>>> but got nothing about this.
>
> Actually, it does - under the option -u, I see:
>
> The commands sort -u and sort | uniq are equivalent, but this equivalence
> does not extend to arbitrary sort options. For example, sort -n -u inspects
> only the value of the initial numeric string when checking for uniqueness,
> whereas sort -n | uniq inspects the entire line. See uniq invocation.
>
> --
> Eric Blake   eblake <at> redhat.com    +1-801-349-2682
> Libvirt virtualization library http://libvirt.org
>



-- 
contact me:
MSN: walkerxk <at> gmail.com
GTALK: walkerxk <at> gmail.com




Information forwarded to bug-coreutils <at> gnu.org:
bug#9995; Package coreutils. (Wed, 09 Nov 2011 14:59:01 GMT) Full text and rfc822 format available.

Message #21 received at 9995 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: 夏凯 <walkerxk <at> gmail.com>, 9995 <at> debbugs.gnu.org
Subject: Re: bug#9995: problem about sort -u -k
Date: Wed, 09 Nov 2011 07:58:42 -0700
[Let's keep the list in the loop]

On 11/08/2011 07:58 PM, 夏凯 wrote:
> thanks for you reply.
> if i want to get my result, whether should i use sort -u -k3 -k1 -k2 a
> to do that?
>

I'm still not quite sure what result you want.

sort -u -k3 -k1 -k2 a

says to sort with three keys - from field 3 to the end of the line, from 
field 1 to the end of the line (aka the entire line), and from field 2 
to the end of the line (that -k2 is useless, since sorting by field 1 to 
the end of the line already sorted everything so that there is no longer 
any distinguishing factors from field 2 to the end of the line).  Then, 
after sorting, sort discards any lines where all three keys are 
identical, and since the -k1 key was the entire line, you are discarding 
only duplicate lines.  But I don't know if that is what you wanted.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org




Information forwarded to bug-coreutils <at> gnu.org:
bug#9995; Package coreutils. (Thu, 10 Nov 2011 03:10:02 GMT) Full text and rfc822 format available.

Message #24 received at 9995 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: 9995 <at> debbugs.gnu.org
Subject: Re: bug#9995: problem about sort -u -k
Date: Wed, 09 Nov 2011 20:08:45 -0700
[top-posting on technical lists is generally frowned on]
[re-adding the list - it's always wiser to keep the list in the loop]

On 11/09/2011 07:25 PM, 夏凯 wrote:
> actually, i just want the result of sort -sk3 a|uniq, we can't just
> use -u to instead of uniq?

Nope, and I already explained why and gave a sample file to demonstrate 
it.  These two are equivalent:

sort -k3 a | uniq
sort -u -k3 -k1 a

but there is no way to get both stable sorting that leaves fields 1 and 
2 unsorted and in the original order, as well as stripping adjacent 
duplicate lines, without also involving a separate uniq process.  That 
is, there is no one-process counterpart to:

sort -s -k3 a | uniq

The reason is that the only way to match uniq behavior is to have the 
sort key cover the entire line, but the moment you add -k1 to cover the 
entire line, your sort is no longer stable on your original sort of just 
-k3.

Also, you may want to consider whether -k3 is what you really meant, or 
if you want to use -k3,3 (that is, whether sorting by the entire line 
except for the first two fields, or sorting by just the third field 
while ignoring any fourth or later field).  Note that I intentionally 
used -k1 as shorthand for the entire line.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 08 Dec 2011 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 255 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.