GNU bug report logs -
#9995
problem about sort -u -k
Previous Next
Reported by: 夏凯 <walkerxk <at> gmail.com>
Date: Tue, 8 Nov 2011 17:25:15 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9995 in the body.
You can then email your comments to 9995 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#9995
; Package
coreutils
.
(Tue, 08 Nov 2011 17:25:15 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
夏凯 <walkerxk <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Tue, 08 Nov 2011 17:25:16 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
when i use sort command with -k and -n together, i got the wrong result:
22:41:21#tp#~> LC_ALL=C
22:41:39#tp#~> /usr/local/bin/sort -u -k1,3 a
1 a q
1 a w
3 a w
22:41:48#tp#~> /usr/local/bin/sort -u -k3 a
1 a q
1 a w
22:41:49#tp#~> cat a
1 a q
1 a w
3 a w
22:41:52#tp#~> /usr/local/bin/sort --version
sort (GNU coreutils) 8.14
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and Paul Eggert.
22:41:57#tp#~>
why is that?
i read http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html,
but got nothing about this.
any help is appreciate.
--
contact me:
MSN: walkerxk <at> gmail.com
GTALK: walkerxk <at> gmail.com
Added tag(s) notabug.
Request was from
Eric Blake <eblake <at> redhat.com>
to
control <at> debbugs.gnu.org
.
(Tue, 08 Nov 2011 18:55:02 GMT)
Full text and
rfc822 format available.
Reply sent
to
Eric Blake <eblake <at> redhat.com>
:
You have taken responsibility.
(Tue, 08 Nov 2011 18:55:04 GMT)
Full text and
rfc822 format available.
Notification sent
to
夏凯 <walkerxk <at> gmail.com>
:
bug acknowledged by developer.
(Tue, 08 Nov 2011 18:55:04 GMT)
Full text and
rfc822 format available.
Message #12 received at 9995-done <at> debbugs.gnu.org (full text, mbox):
tag 9995 notabug
thanks
On 11/08/2011 07:49 AM, 夏凯 wrote:
> when i use sort command with -k and -n together, i got the wrong result:
Thanks for the report; however, this is most likely not a bug in sort,
but in your usage patterns. Your sentence mentioned -k and -n together,
but your example and subject line mentioned -u and -k together; so I'll
assume that you got surprised by -u, not -n.
> 22:41:21#tp#~> LC_ALL=C
Unless you also did 'export LC_ALL' at some point, this does not
guarantee that child processes will see this setting in their environment.
> 22:41:39#tp#~> /usr/local/bin/sort -u -k1,3 a
> 1 a q
> 1 a w
> 3 a w
> 22:41:48#tp#~> /usr/local/bin/sort -u -k3 a
> 1 a q
> 1 a w
> 22:41:49#tp#~> cat a
> 1 a q
> 1 a w
> 3 a w
> 22:41:52#tp#~> /usr/local/bin/sort --version
> sort (GNU coreutils) 8.14
That's new enough that you can use the --debug option to see what was
really going on:
$ LC_ALL=C ../coreutils/src/sort --debug -u -k1,3 a
sort: using simple byte comparison
1 a q
_____
1 a w
_____
3 a w
_____
Here, you compared all three lines, which were all distinct.
$ LC_ALL=C ../coreutils/src/sort --debug -u -k3 a
sort: using simple byte comparison
1 a q
__
1 a w
__
Here, you told sort to only look at a key of field three onwards, and to
uniquify the results (that is, don't display multiple lines if they had
the same sort key). Since two lines both have the string " w" as the
-k3 key, sort -u picked one of those lines (namely "3 a w") to be
discarded on output. This behavior matches POSIX rules.
Since you didn't tell us what output you were hoping to get, I can't
tell you the proper command line that would match your expected output.
Feel free to reply, even while this bug is closed, if you need more
help in getting the output you want. Also, if you can prove that sort
is doing something wrong, then feel free to reopen this bug with more
evidence of why it is a bug in sort, including --debug output to back up
your claim (but be aware that more than 90% of "bug" reports against
sort have been debunked as user error rather than an actual bug in sort).
--
Eric Blake eblake <at> redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9995
; Package
coreutils
.
(Tue, 08 Nov 2011 19:46:02 GMT)
Full text and
rfc822 format available.
Message #15 received at 9995 <at> debbugs.gnu.org (full text, mbox):
On 11/08/2011 11:54 AM, Eric Blake wrote:
>> 22:41:39#tp#~> /usr/local/bin/sort -u -k1,3 a
>> 1 a q
>> 1 a w
>> 3 a w
>> 22:41:48#tp#~> /usr/local/bin/sort -u -k3 a
>> 1 a q
>> 1 a w
> Since you didn't tell us what output you were hoping to get, I can't
> tell you the proper command line that would match your expected output.
> Feel free to reply, even while this bug is closed, if you need more help
> in getting the output you want.
I'll give a preemptive attempt at guessing what you meant, as well:
If you wanted to sort on just the third and subsequent fields, but then
strip duplicate lines only if the entire line is duplicate, then you
have to use two processes:
sort [-s] -k3 a | uniq
If you don't mind a two-key sort, where the primary key is the third and
subsequent fields, but where the secondary key is the entire line so as
to force sort -u to consider the entire line when determining
uniqueness, then one process will do:
sort -u -k3 -k1 a
To see the difference, and remembering that sort -u implies sort -s,
consider these contents for a:
$ cat a
1 a q
2 a q
1 a q
1 a w
3 a w
$ sort -u -k3 -k1 a
1 a q
2 a q
1 a w
3 a w
$ sort -s -k3 a | uniq
1 a q
2 a q
1 a q
1 a w
3 a w
$ sort -k3 a | uniq
1 a q
2 a q
1 a w
3 a w
That is, if the stable sort of just -k3 leaves identical lines that are
not adjacent ("1 a q" in my example), then the separate uniq process
won't filter them; while using sort -u with -k1 as the means to force
the entire line as a secondary sort key loses the ability to leave
identical lines separated by a distinct line. Likewise, omitting both
-s and -u lets sort imply a last-resort -k1, at which point uniq sees
the same line order as sort -u sees.
>> i read
http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html,
>> but got nothing about this.
Actually, it does - under the option -u, I see:
The commands sort -u and sort | uniq are equivalent, but this
equivalence does not extend to arbitrary sort options. For example, sort
-n -u inspects only the value of the initial numeric string when
checking for uniqueness, whereas sort -n | uniq inspects the entire
line. See uniq invocation.
--
Eric Blake eblake <at> redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9995
; Package
coreutils
.
(Wed, 09 Nov 2011 14:04:02 GMT)
Full text and
rfc822 format available.
Message #18 received at 9995 <at> debbugs.gnu.org (full text, mbox):
thanks for you reply.
if i want to use the entire line as a key, and sort by the third
field, whether should i use sort -u -k3 -k1 -k2 a to do that?
On Wed, Nov 9, 2011 at 03:45, Eric Blake <eblake <at> redhat.com> wrote:
> On 11/08/2011 11:54 AM, Eric Blake wrote:
>>>
>>> 22:41:39#tp#~> /usr/local/bin/sort -u -k1,3 a
>>> 1 a q
>>> 1 a w
>>> 3 a w
>>> 22:41:48#tp#~> /usr/local/bin/sort -u -k3 a
>>> 1 a q
>>> 1 a w
>
>> Since you didn't tell us what output you were hoping to get, I can't
>> tell you the proper command line that would match your expected output.
>> Feel free to reply, even while this bug is closed, if you need more help
>> in getting the output you want.
>
> I'll give a preemptive attempt at guessing what you meant, as well:
>
> If you wanted to sort on just the third and subsequent fields, but then
> strip duplicate lines only if the entire line is duplicate, then you have to
> use two processes:
>
> sort [-s] -k3 a | uniq
>
> If you don't mind a two-key sort, where the primary key is the third and
> subsequent fields, but where the secondary key is the entire line so as to
> force sort -u to consider the entire line when determining uniqueness, then
> one process will do:
>
> sort -u -k3 -k1 a
>
> To see the difference, and remembering that sort -u implies sort -s,
> consider these contents for a:
>
> $ cat a
> 1 a q
> 2 a q
> 1 a q
> 1 a w
> 3 a w
> $ sort -u -k3 -k1 a
> 1 a q
> 2 a q
> 1 a w
> 3 a w
> $ sort -s -k3 a | uniq
> 1 a q
> 2 a q
> 1 a q
> 1 a w
> 3 a w
> $ sort -k3 a | uniq
> 1 a q
> 2 a q
> 1 a w
> 3 a w
>
> That is, if the stable sort of just -k3 leaves identical lines that are not
> adjacent ("1 a q" in my example), then the separate uniq process won't
> filter them; while using sort -u with -k1 as the means to force the entire
> line as a secondary sort key loses the ability to leave identical lines
> separated by a distinct line. Likewise, omitting both -s and -u lets sort
> imply a last-resort -k1, at which point uniq sees the same line order as
> sort -u sees.
>
>>> i read
>>> http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html,
>>> but got nothing about this.
>
> Actually, it does - under the option -u, I see:
>
> The commands sort -u and sort | uniq are equivalent, but this equivalence
> does not extend to arbitrary sort options. For example, sort -n -u inspects
> only the value of the initial numeric string when checking for uniqueness,
> whereas sort -n | uniq inspects the entire line. See uniq invocation.
>
> --
> Eric Blake eblake <at> redhat.com +1-801-349-2682
> Libvirt virtualization library http://libvirt.org
>
--
contact me:
MSN: walkerxk <at> gmail.com
GTALK: walkerxk <at> gmail.com
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9995
; Package
coreutils
.
(Wed, 09 Nov 2011 14:59:01 GMT)
Full text and
rfc822 format available.
Message #21 received at 9995 <at> debbugs.gnu.org (full text, mbox):
[Let's keep the list in the loop]
On 11/08/2011 07:58 PM, 夏凯 wrote:
> thanks for you reply.
> if i want to get my result, whether should i use sort -u -k3 -k1 -k2 a
> to do that?
>
I'm still not quite sure what result you want.
sort -u -k3 -k1 -k2 a
says to sort with three keys - from field 3 to the end of the line, from
field 1 to the end of the line (aka the entire line), and from field 2
to the end of the line (that -k2 is useless, since sorting by field 1 to
the end of the line already sorted everything so that there is no longer
any distinguishing factors from field 2 to the end of the line). Then,
after sorting, sort discards any lines where all three keys are
identical, and since the -k1 key was the entire line, you are discarding
only duplicate lines. But I don't know if that is what you wanted.
--
Eric Blake eblake <at> redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9995
; Package
coreutils
.
(Thu, 10 Nov 2011 03:10:02 GMT)
Full text and
rfc822 format available.
Message #24 received at 9995 <at> debbugs.gnu.org (full text, mbox):
[top-posting on technical lists is generally frowned on]
[re-adding the list - it's always wiser to keep the list in the loop]
On 11/09/2011 07:25 PM, 夏凯 wrote:
> actually, i just want the result of sort -sk3 a|uniq, we can't just
> use -u to instead of uniq?
Nope, and I already explained why and gave a sample file to demonstrate
it. These two are equivalent:
sort -k3 a | uniq
sort -u -k3 -k1 a
but there is no way to get both stable sorting that leaves fields 1 and
2 unsorted and in the original order, as well as stripping adjacent
duplicate lines, without also involving a separate uniq process. That
is, there is no one-process counterpart to:
sort -s -k3 a | uniq
The reason is that the only way to match uniq behavior is to have the
sort key cover the entire line, but the moment you add -k1 to cover the
entire line, your sort is no longer stable on your original sort of just
-k3.
Also, you may want to consider whether -k3 is what you really meant, or
if you want to use -k3,3 (that is, whether sorting by the entire line
except for the first two fields, or sorting by just the third field
while ignoring any fourth or later field). Note that I intentionally
used -k1 as shorthand for the entire line.
--
Eric Blake eblake <at> redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 08 Dec 2011 12:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 13 years and 255 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.