GNU bug report logs - #11220
uniq -d and -Du bug?

Reported by: phil colbourn <philcolbourn <at> gmail.com>

Date: Wed, 11 Apr 2012 06:25:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 11220 in the body.
You can then email your comments to 11220 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#11220; Package coreutils. (Wed, 11 Apr 2012 06:25:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to phil colbourn <philcolbourn <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 11 Apr 2012 06:25:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: phil colbourn <philcolbourn <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: uniq -d and -Du bug?
Date: Wed, 11 Apr 2012 15:43:02 +1000

[Message part 1 (text/plain, inline)]

What should this print?

echo -e 'aa\naa\naa\n' | uniq -d

To me this says:

1. uniqueness is defined by whole line so there is 1 unique value 'aa';
2. -d option say to 'only print duplicate lines';
3. 1st 'aa' is (so far) unique so it should NOT be printed;
4. 2nd 'aa' is not unique so it SHOULD be printed; and
5. 3rd 'aa' is not unique so it SHOULD also be printed.

I think I should get this:

aa
aa

But I get this:

aa

To see what duplicated line is printed I tried this:

echo -e 'a1\na2\na3\na4\n' | uniq -d -w 1
a1

So, first line is printed. This is not what I expected at all.



Now, -D means 'print all duplicate lines' and

echo -e 'aa\naa\naa\n' | uniq -D

prints what I expect it to:

aa
aa
aa

Now, -D and -u means 'print all duplicate lines' and 'only print unique
lines'.

I think this should print all lines since union of all unique lines and all
duplicate lines is all lines.

But,

echo -e 'aa\naa\naa\n' | uniq -Du

prints this:

aa
aa

To see what lines are being printed I tried this:

echo -e 'a1\na2\na3\na4\n' | uniq -Du -w 1
a1
a2
a3

Therefore -Du prints first N-1 matching lines and not last matching line.

(Which is sort-of like what I expect -d to print)

Are these bugs?

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#11220; Package coreutils. (Wed, 11 Apr 2012 10:09:01 GMT) Full text and rfc822 format available.

Message #8 received at 11220 <at> debbugs.gnu.org (full text, mbox):

From: phil colbourn <philcolbourn <at> gmail.com>
To: 11220 <at> debbugs.gnu.org
Subject: Sorry, forgot version.
Date: Wed, 11 Apr 2012 20:07:23 +1000

[Message part 1 (text/plain, inline)]

uniq version 8.13

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#11220; Package coreutils. (Wed, 11 Apr 2012 12:01:01 GMT) Full text and rfc822 format available.

Message #11 received at 11220 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: phil colbourn <philcolbourn <at> gmail.com>
Cc: 11220 <at> debbugs.gnu.org
Subject: Re: bug#11220: uniq -d and -Du bug?
Date: Wed, 11 Apr 2012 13:59:31 +0200

phil colbourn wrote:

> What should this print?
>
> echo -e 'aa\naa\naa\n' | uniq -d

It's better to avoid echo -e.  Use printf instead:

    printf 'aa\naa\naa\n' | uniq -d

> To me this says:
>
> 1. uniqueness is defined by whole line so there is 1 unique value 'aa';
> 2. -d option say to 'only print duplicate lines';

When in doubt, follow the advice at the bottom of the man page
and read the "real" (texinfo) documentation:

       The full documentation for uniq is maintained as a Texinfo manual.   If
       the  info  and  uniq  programs are properly installed at your site, the
       command

              info coreutils 'uniq invocation'

       should give you access to the complete manual.

> 3. 1st 'aa' is (so far) unique so it should NOT be printed;
> 4. 2nd 'aa' is not unique so it SHOULD be printed; and
> 5. 3rd 'aa' is not unique so it SHOULD also be printed.
>
> I think I should get this:
>
> aa
> aa
>
> But I get this:
>
> aa

Thanks for the report.
The problem is that the description in the man page is too succinct,
perhaps because -d means different things, depending on what other
options you use it with.

How is what you see inconsistent with the documentation?
(info coreutils uniq)

`-d'
`--repeated'
     Discard lines that are not repeated.  When used by itself, this
     option causes `uniq' to print the first copy of each repeated line,
     and nothing else.

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Wed, 11 Apr 2012 12:09:01 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Wed, 11 Apr 2012 12:09:02 GMT) Full text and rfc822 format available.

Notification sent to phil colbourn <philcolbourn <at> gmail.com>:
bug acknowledged by developer. (Wed, 11 Apr 2012 12:09:02 GMT) Full text and rfc822 format available.

Message #18 received at 11220-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: phil colbourn <philcolbourn <at> gmail.com>
Cc: 11220-done <at> debbugs.gnu.org
Subject: Re: bug#11220: uniq -d and -Du bug?
Date: Wed, 11 Apr 2012 06:07:30 -0600

[Message part 1 (text/plain, inline)]

tag 11220 notabug
thanks

On 04/10/2012 11:43 PM, phil colbourn wrote:
> What should this print?
> 
> echo -e 'aa\naa\naa\n' | uniq -d

Thanks for the report.  POSIX requires this to print only a single
instance of 'aa', whether or not -d is in effect; coreutils does this by
outputting the last line in a series of duplicates.  The point of -d is
to suppress the single-line outputs that do not have a corresponding
duplicate input, not to output all instances of a duplicated line.

By the way, 'echo -e' is not portable; POSIX recommends you use printf
instead.

> 
> Now, -D and -u means 'print all duplicate lines' and 'only print unique
> lines'.

-D is not specified by POSIX.  However, -u is defined by POSIX to
suppress output lines that have a corresponding duplicate input.

> 
> I think this should print all lines since union of all unique lines and all
> duplicate lines is all lines.
> 

> 
> Therefore -Du prints first N-1 matching lines and not last matching line.

In isolation, uniq prints the last instance of the duplicated line, and
uniq -u suppresses the output of the 4th line.  In isolation, -D says to
output the first three lines which are normally omitted because they
have duplicates, in addition to the 4th line that is printed by default.
 So in combination, -Du says to print the lines with subsequent
duplicates (the first three lines) but to suppress the output line that
corresponds to the last input line that ends a sequence of duplicates
(the 4th line).

Perhaps we can document this behavior better.  Or perhaps we can change
the behavior of -D (but at risk of breaking existing clients that depend
on the current behavior).  But we can't change -u or -d behavior.

Put another way, per POSIX, the default behavior is subtractive (remove
any line with a subsequent duplicate), -d is subtractive (remove any
line with no duplicate), and -u is subtractive (remove any last line
that had a prior duplicate), and GNU -D is additive (print any line with
a subsequent duplicate, to counter the initial default).

> 
> Are these bugs?

At this point, I will claim that the behavior is intended, and therefore
close out the bug.  But if you are willing to submit documentation
patches, or even code patches accompanied by extensive test cases to
demonstrate the corner cases of any new behavior, feel free to continue
to reply to this bug report.

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Message #19 received at 11220-done <at> debbugs.gnu.org (full text, mbox):

From: phil colbourn <philcolbourn <at> gmail.com>
To: 11220-done <at> debbugs.gnu.org
Subject: Re: bug#11220: uniq -d and -Du bug?
Date: Thu, 12 Apr 2012 22:47:39 +1000

[Message part 1 (text/plain, inline)]

Thanks Jim and Eric for your replies.

Jim, perhaps info's version of option -d should be used in uniq's man page?

In isolation, uniq prints the last instance of the duplicated line, and
>

I think it prints the first line, not the last:

printf "a1\na2\na3\na4" | uniq -w 1
a1


> uniq -u suppresses the output of the 4th line.


I have lost you here. -u suppresses any lines with duplicates.

printf "a1\na2\na3\na4" | uniq -u -w 1
(no output)

I suspect you mean -Du?

printf "a1\na2\na3\na4" | uniq -Du -w 1
a1
a2
a3


>  In isolation, -D says to
> output the first three lines which are normally omitted because they
> have duplicates, in addition to the 4th line that is printed by default.
>

But,

printf "a1\na2\na3\na4" | uniq -D -w 1
a1
a2
a3
a4

and default is this:

printf "a1\na2\na3\na4" | uniq -w 1
a1

So, if I understand your logic correctly and if I correct your logic by
referring to the 1st and not the 4th duplicate then -Du should give me

a2
a3
a4


>  So in combination, -Du says to print the lines with subsequent
> duplicates (the first three lines) but to suppress the output line that
> corresponds to the last input line that ends a sequence of duplicates
> (the 4th line).
>
>



> Perhaps we can document this behavior better.  Or perhaps we can change
> the behavior of -D (but at risk of breaking existing clients that depend
> on the current behavior).  But we can't change -u or -d behavior.
>
>
I think changing behaviour of a utility is dangerous - I thought it was a
bug, but both you and Jim have indicated that it is poor documentation.



> Put another way, per POSIX, the default behavior is subtractive (remove
> any line with a subsequent duplicate), -d is subtractive (remove any
> line with no duplicate), and -u is subtractive (remove any last line
> that had a prior duplicate), and GNU -D is additive (print any line with
> a subsequent duplicate, to counter the initial default).
>
>
Whilst not exactly following your previous notes, I don't think this
explains uniq's behaviour.

At this point, I will claim that the behavior is intended, and therefore
> close out the bug.  But if you are willing to submit documentation
> patches, or even code patches accompanied by extensive test cases to
> demonstrate the corner cases of any new behavior, feel free to continue
> to reply to this bug report.
>
>
Having now read info's uniq pages, I think that

1) -Du is undefined behaviour
2) current output makes no sense.

eg.

printf "1\na1\na2\na3\na4\n2" | uniq -w 1
1
a1
2
(comply)

printf "1\na1\na2\na3\na4\n2" | uniq -u -w 1
1
2
(comply)

printf "1\na1\na2\na3\na4\n2" | uniq -D -w 1
a1
a2
a3
a4
(comply)

printf "1\na1\na2\na3\na4\n2" | uniq -Du -w 1
a1
a2
a3

I think this one makes no sense.

But... this behaviour IS exactly what I need so if this can be documented
then I would be happy - others might not be.


On a side point, why can't the official info pages be automagically
converted into man pages to avoid discrepancies?

[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 11 May 2012 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 135 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #11220 uniq -d and -Du bug?

GNU bug report logs - #11220
uniq -d and -Du bug?