GNU bug report logs -
#11220
uniq -d and -Du bug?
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 11220 in the body.
You can then email your comments to 11220 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#11220
; Package
coreutils
.
(Wed, 11 Apr 2012 06:25:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
phil colbourn <philcolbourn <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 11 Apr 2012 06:25:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
What should this print?
echo -e 'aa\naa\naa\n' | uniq -d
To me this says:
1. uniqueness is defined by whole line so there is 1 unique value 'aa';
2. -d option say to 'only print duplicate lines';
3. 1st 'aa' is (so far) unique so it should NOT be printed;
4. 2nd 'aa' is not unique so it SHOULD be printed; and
5. 3rd 'aa' is not unique so it SHOULD also be printed.
I think I should get this:
aa
aa
But I get this:
aa
To see what duplicated line is printed I tried this:
echo -e 'a1\na2\na3\na4\n' | uniq -d -w 1
a1
So, first line is printed. This is not what I expected at all.
Now, -D means 'print all duplicate lines' and
echo -e 'aa\naa\naa\n' | uniq -D
prints what I expect it to:
aa
aa
aa
Now, -D and -u means 'print all duplicate lines' and 'only print unique
lines'.
I think this should print all lines since union of all unique lines and all
duplicate lines is all lines.
But,
echo -e 'aa\naa\naa\n' | uniq -Du
prints this:
aa
aa
To see what lines are being printed I tried this:
echo -e 'a1\na2\na3\na4\n' | uniq -Du -w 1
a1
a2
a3
Therefore -Du prints first N-1 matching lines and not last matching line.
(Which is sort-of like what I expect -d to print)
Are these bugs?
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#11220
; Package
coreutils
.
(Wed, 11 Apr 2012 10:09:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 11220 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
uniq version 8.13
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#11220
; Package
coreutils
.
(Wed, 11 Apr 2012 12:01:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 11220 <at> debbugs.gnu.org (full text, mbox):
phil colbourn wrote:
> What should this print?
>
> echo -e 'aa\naa\naa\n' | uniq -d
It's better to avoid echo -e. Use printf instead:
printf 'aa\naa\naa\n' | uniq -d
> To me this says:
>
> 1. uniqueness is defined by whole line so there is 1 unique value 'aa';
> 2. -d option say to 'only print duplicate lines';
When in doubt, follow the advice at the bottom of the man page
and read the "real" (texinfo) documentation:
The full documentation for uniq is maintained as a Texinfo manual. If
the info and uniq programs are properly installed at your site, the
command
info coreutils 'uniq invocation'
should give you access to the complete manual.
> 3. 1st 'aa' is (so far) unique so it should NOT be printed;
> 4. 2nd 'aa' is not unique so it SHOULD be printed; and
> 5. 3rd 'aa' is not unique so it SHOULD also be printed.
>
> I think I should get this:
>
> aa
> aa
>
> But I get this:
>
> aa
Thanks for the report.
The problem is that the description in the man page is too succinct,
perhaps because -d means different things, depending on what other
options you use it with.
How is what you see inconsistent with the documentation?
(info coreutils uniq)
`-d'
`--repeated'
Discard lines that are not repeated. When used by itself, this
option causes `uniq' to print the first copy of each repeated line,
and nothing else.
Added tag(s) notabug.
Request was from
Eric Blake <eblake <at> redhat.com>
to
control <at> debbugs.gnu.org
.
(Wed, 11 Apr 2012 12:09:01 GMT)
Full text and
rfc822 format available.
Reply sent
to
Eric Blake <eblake <at> redhat.com>
:
You have taken responsibility.
(Wed, 11 Apr 2012 12:09:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
phil colbourn <philcolbourn <at> gmail.com>
:
bug acknowledged by developer.
(Wed, 11 Apr 2012 12:09:02 GMT)
Full text and
rfc822 format available.
Message #18 received at 11220-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
tag 11220 notabug
thanks
On 04/10/2012 11:43 PM, phil colbourn wrote:
> What should this print?
>
> echo -e 'aa\naa\naa\n' | uniq -d
Thanks for the report. POSIX requires this to print only a single
instance of 'aa', whether or not -d is in effect; coreutils does this by
outputting the last line in a series of duplicates. The point of -d is
to suppress the single-line outputs that do not have a corresponding
duplicate input, not to output all instances of a duplicated line.
By the way, 'echo -e' is not portable; POSIX recommends you use printf
instead.
>
> Now, -D and -u means 'print all duplicate lines' and 'only print unique
> lines'.
-D is not specified by POSIX. However, -u is defined by POSIX to
suppress output lines that have a corresponding duplicate input.
>
> I think this should print all lines since union of all unique lines and all
> duplicate lines is all lines.
>
>
> Therefore -Du prints first N-1 matching lines and not last matching line.
In isolation, uniq prints the last instance of the duplicated line, and
uniq -u suppresses the output of the 4th line. In isolation, -D says to
output the first three lines which are normally omitted because they
have duplicates, in addition to the 4th line that is printed by default.
So in combination, -Du says to print the lines with subsequent
duplicates (the first three lines) but to suppress the output line that
corresponds to the last input line that ends a sequence of duplicates
(the 4th line).
Perhaps we can document this behavior better. Or perhaps we can change
the behavior of -D (but at risk of breaking existing clients that depend
on the current behavior). But we can't change -u or -d behavior.
Put another way, per POSIX, the default behavior is subtractive (remove
any line with a subsequent duplicate), -d is subtractive (remove any
line with no duplicate), and -u is subtractive (remove any last line
that had a prior duplicate), and GNU -D is additive (print any line with
a subsequent duplicate, to counter the initial default).
>
> Are these bugs?
At this point, I will claim that the behavior is intended, and therefore
close out the bug. But if you are willing to submit documentation
patches, or even code patches accompanied by extensive test cases to
demonstrate the corner cases of any new behavior, feel free to continue
to reply to this bug report.
--
Eric Blake eblake <at> redhat.com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Message #19 received at 11220-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks Jim and Eric for your replies.
Jim, perhaps info's version of option -d should be used in uniq's man page?
In isolation, uniq prints the last instance of the duplicated line, and
>
I think it prints the first line, not the last:
printf "a1\na2\na3\na4" | uniq -w 1
a1
> uniq -u suppresses the output of the 4th line.
I have lost you here. -u suppresses any lines with duplicates.
printf "a1\na2\na3\na4" | uniq -u -w 1
(no output)
I suspect you mean -Du?
printf "a1\na2\na3\na4" | uniq -Du -w 1
a1
a2
a3
> In isolation, -D says to
> output the first three lines which are normally omitted because they
> have duplicates, in addition to the 4th line that is printed by default.
>
But,
printf "a1\na2\na3\na4" | uniq -D -w 1
a1
a2
a3
a4
and default is this:
printf "a1\na2\na3\na4" | uniq -w 1
a1
So, if I understand your logic correctly and if I correct your logic by
referring to the 1st and not the 4th duplicate then -Du should give me
a2
a3
a4
> So in combination, -Du says to print the lines with subsequent
> duplicates (the first three lines) but to suppress the output line that
> corresponds to the last input line that ends a sequence of duplicates
> (the 4th line).
>
>
> Perhaps we can document this behavior better. Or perhaps we can change
> the behavior of -D (but at risk of breaking existing clients that depend
> on the current behavior). But we can't change -u or -d behavior.
>
>
I think changing behaviour of a utility is dangerous - I thought it was a
bug, but both you and Jim have indicated that it is poor documentation.
> Put another way, per POSIX, the default behavior is subtractive (remove
> any line with a subsequent duplicate), -d is subtractive (remove any
> line with no duplicate), and -u is subtractive (remove any last line
> that had a prior duplicate), and GNU -D is additive (print any line with
> a subsequent duplicate, to counter the initial default).
>
>
Whilst not exactly following your previous notes, I don't think this
explains uniq's behaviour.
At this point, I will claim that the behavior is intended, and therefore
> close out the bug. But if you are willing to submit documentation
> patches, or even code patches accompanied by extensive test cases to
> demonstrate the corner cases of any new behavior, feel free to continue
> to reply to this bug report.
>
>
Having now read info's uniq pages, I think that
1) -Du is undefined behaviour
2) current output makes no sense.
eg.
printf "1\na1\na2\na3\na4\n2" | uniq -w 1
1
a1
2
(comply)
printf "1\na1\na2\na3\na4\n2" | uniq -u -w 1
1
2
(comply)
printf "1\na1\na2\na3\na4\n2" | uniq -D -w 1
a1
a2
a3
a4
(comply)
printf "1\na1\na2\na3\na4\n2" | uniq -Du -w 1
a1
a2
a3
I think this one makes no sense.
But... this behaviour IS exactly what I need so if this can be documented
then I would be happy - others might not be.
On a side point, why can't the official info pages be automagically
converted into man pages to avoid discrepancies?
[Message part 2 (text/html, inline)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 11 May 2012 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 13 years and 135 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.