GNU bug report logs - #14555
Facing Some problem in uniq command

Reported by: Shahid Hussain <shnx88 <at> gmail.com>

Date: Tue, 4 Jun 2013 16:21:02 UTC

Severity: normal

Tags: moreinfo

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 14555 in the body.
You can then email your comments to 14555 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#14555; Package coreutils. (Tue, 04 Jun 2013 16:21:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Shahid Hussain <shnx88 <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 04 Jun 2013 16:21:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Shahid Hussain <shnx88 <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Facing Some problem in uniq command
Date: Tue, 4 Jun 2013 17:37:27 +0530

I have a file (named 'a')which contains following data.
"";
8003
8004
8005
8010
9040
9041
9042
8336
8336
8337
8338
8338
8339
8340
8341
9000
9216
9217
9218
9219
9220
9221
9232
9233
9234
9248
9249
9250
9251
9264
9265
9280
9296
9281
9297
9001
9226
9040
9040
15008
9041
9042
15009
15010
6169
6170
18000
18000

*************************************************
And Below is the commands i am executing along with its output with
comments.
[ussc <at> lab211 config]$ uniq -d a
8336
8338
//Displaying one duplicate entry But so many duplicate entries are there in
file
[ussc <at> lab211 config]$ uniq -D a
8336
8336
8338
8338
//Displaying only two duplicate entry But so many duplicate entries are
there in file
[ussc <at> lab211 config]$ uniq -c a
      1 "";
      1 8003
      1 8004
      1 8005
      1 8010
      1 9040
      1 9041
      1 9042
      2 8336
      1 8337
      2 8338
      1 8339
      1 8340
      1 8341
      1 9000
      1 9216
      1 9217
      1 9218
      1 9219
      1 9220
      1 9221
      1 9232
      1 9233
      1 9234
      1 9248
      1 9249
      1 9250
      1 9251
      1 9264
      1 9265
      1 9280
      1 9296
      1 9281
      1 9297
      1 9001
      1 9226
      1 9040
      1 9040
      1 15008
      1 9041
      1 9042
      1 15009
      1 15010
      1 6169
      1 6170
      1 18000
      1 18000
//Observe last line which is repeated with its previous line (some other
entries are also there)but uniq command not able to find it.



Please check it once and let me know if i am wrong.
Thanks and Regards,
Shahid Hussain
Bangalore.

Information forwarded to bug-coreutils <at> gnu.org:
bug#14555; Package coreutils. (Tue, 04 Jun 2013 16:46:01 GMT) Full text and rfc822 format available.

Message #8 received at 14555 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Shahid Hussain <shnx88 <at> gmail.com>
Cc: 14555 <at> debbugs.gnu.org
Subject: Re: bug#14555: Facing Some problem in uniq command
Date: Tue, 04 Jun 2013 17:43:38 +0100

On 06/04/2013 01:07 PM, Shahid Hussain wrote:
> I have a file (named 'a')which contains following data.
> "";
> 8003
> 8004
> 8005
> 8010
> 9040
> 9041
> 9042
> 8336
> 8336
> 8337
> 8338
> 8338
> 8339
> 8340
> 8341
> 9000
> 9216
> 9217
> 9218
> 9219
> 9220
> 9221
> 9232
> 9233
> 9234
> 9248
> 9249
> 9250
> 9251
> 9264
> 9265
> 9280
> 9296
> 9281
> 9297
> 9001
> 9226
> 9040
> 9040
> 15008
> 9041
> 9042
> 15009
> 15010
> 6169
> 6170
> 18000
> 18000
> 
> *************************************************
> And Below is the commands i am executing along with its output with
> comments.
> [ussc <at> lab211 config]$ uniq -d a
> 8336
> 8338
> //Displaying one duplicate entry But so many duplicate entries are there in
> file
> [ussc <at> lab211 config]$ uniq -D a
> 8336
> 8336
> 8338
> 8338
> //Displaying only two duplicate entry But so many duplicate entries are
> there in file
> [ussc <at> lab211 config]$ uniq -c a
>       1 "";
>       1 8003
>       1 8004
>       1 8005
>       1 8010
>       1 9040
>       1 9041
>       1 9042
>       2 8336
>       1 8337
>       2 8338
>       1 8339
>       1 8340
>       1 8341
>       1 9000
>       1 9216
>       1 9217
>       1 9218
>       1 9219
>       1 9220
>       1 9221
>       1 9232
>       1 9233
>       1 9234
>       1 9248
>       1 9249
>       1 9250
>       1 9251
>       1 9264
>       1 9265
>       1 9280
>       1 9296
>       1 9281
>       1 9297
>       1 9001
>       1 9226
>       1 9040
>       1 9040
>       1 15008
>       1 9041
>       1 9042
>       1 15009
>       1 15010
>       1 6169
>       1 6170
>       1 18000
>       1 18000
> //Observe last line which is repeated with its previous line (some other
> entries are also there)but uniq command not able to find it.
> 
> 
> 
> Please check it once and let me know if i am wrong.
> Thanks and Regards,
> Shahid Hussain
> Bangalore.
> 
> 

Note 9041 is also repeated but you won't see that
until you sort first, though that's not your specific issue here.

Perhaps you have mixed \n and \r\n line endings or something?
This might be informative?

tail -n2 a | od -Ax -tx1z -v

thanks,
Pádraig.

Information forwarded to bug-coreutils <at> gnu.org:
bug#14555; Package coreutils. (Tue, 04 Jun 2013 17:27:01 GMT) Full text and rfc822 format available.

Message #11 received at 14555 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Shahid Hussain <shnx88 <at> gmail.com>
Cc: GNU bug tracker automated control server <control <at> debbugs.gnu.org>,
	14555 <at> debbugs.gnu.org
Subject: Re: bug#14555: Facing Some problem in uniq command
Date: Tue, 04 Jun 2013 10:30:25 -0600

[Message part 1 (text/plain, inline)]

tag 14555 moreinfo
thanks

On 06/04/2013 06:07 AM, Shahid Hussain wrote:
> I have a file (named 'a')which contains following data.

> 9041
> 9042
> 8336
...

> 9041

Ouch.  Your file is not sorted.  Therefore, 9041 is NOT unique when run
through 'uniq', which only compares adjacent lines.

> And Below is the commands i am executing along with its output with
> comments.
> [ussc <at> lab211 config]$ uniq -d a
> 8336
> 8338

I get different results when copying and pasting from your email:
$ uniq -d a
8336
8338
9040
18000
$ uniq --version | head -n1
uniq (GNU coreutils) 8.17

Could it be you are using an older version of coreutils, and we have
fixed a bug in the meantime for how unique behaves when presented an
unsorted file?

>       1 18000
>       1 18000
> //Observe last line which is repeated with its previous line (some other
> entries are also there)but uniq command not able to find it.

One other possibility: Are you sure the whitespace is identical on every
line?  Or could you have trailing whitespace on one line but not the
other (such as a carriage return), so that the lines really are not
unique even though they appeared unique?  If so, that would explain why
_my_ uniq run counted 18000 as a duplicate, if the act of sending the
email and then me copying and pasting into a file munged the whitespace
differences away.

While I suspect that there is no bug in coreutils, I need more
information from you to confirm that claim, so I'm leaving the bug open
for now.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Added tag(s) moreinfo. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Tue, 04 Jun 2013 17:27:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#14555; Package coreutils. (Wed, 05 Jun 2013 05:20:02 GMT) Full text and rfc822 format available.

Message #16 received at 14555 <at> debbugs.gnu.org (full text, mbox):

From: Shahid Hussain <shnx88 <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: GNU bug tracker automated control server <control <at> debbugs.gnu.org>,
	14555 <at> debbugs.gnu.org
Subject: Re: bug#14555: Facing Some problem in uniq command
Date: Wed, 5 Jun 2013 10:47:43 +0530

[Message part 1 (text/plain, inline)]

Hi,
Appreciate your quick reply. What exactly i m doing is there are so many
files in my product which contains some data in "name =  value" format. By
using some pattern i m extracting only "value" field from all files and
redirecting the output to one temporarily file as i do not want any value
to be repeated in any file. And here i m applying uniq command to this
temporary file (by pipe lining sort [sort |uniq -c tempFile]) But i am
unable to get expected result.

But as you have told whitespace also should be identical at every line so
this might be the problem in my case. Because when i displayed content of
file using cat command and manually copied the same data to another file
and then tried uniq with sort command it works fine.

So it is fine for me but it would be too better if there could be an option
in uniq command to work fine even if  whitespace is not identical :).

Lot of thanks,
shahid hussain

On Tue, Jun 4, 2013 at 10:00 PM, Eric Blake <eblake <at> redhat.com> wrote:

> tag 14555 moreinfo
> thanks
>
> On 06/04/2013 06:07 AM, Shahid Hussain wrote:
> > I have a file (named 'a')which contains following data.
>
> > 9041
> > 9042
> > 8336
> ...
>
> > 9041
>
> Ouch.  Your file is not sorted.  Therefore, 9041 is NOT unique when run
> through 'uniq', which only compares adjacent lines.
>
> > And Below is the commands i am executing along with its output with
> > comments.
> > [ussc <at> lab211 config]$ uniq -d a
> > 8336
> > 8338
>
> I get different results when copying and pasting from your email:
> $ uniq -d a
> 8336
> 8338
> 9040
> 18000
> $ uniq --version | head -n1
> uniq (GNU coreutils) 8.17
>
> Could it be you are using an older version of coreutils, and we have
> fixed a bug in the meantime for how unique behaves when presented an
> unsorted file?
>
> >       1 18000
> >       1 18000
> > //Observe last line which is repeated with its previous line (some other
> > entries are also there)but uniq command not able to find it.
>
> One other possibility: Are you sure the whitespace is identical on every
> line?  Or could you have trailing whitespace on one line but not the
> other (such as a carriage return), so that the lines really are not
> unique even though they appeared unique?  If so, that would explain why
> _my_ uniq run counted 18000 as a duplicate, if the act of sending the
> email and then me copying and pasting into a file munged the whitespace
> differences away.
>
> While I suspect that there is no bug in coreutils, I need more
> information from you to confirm that claim, so I'm leaving the bug open
> for now.
>
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#14555; Package coreutils. (Wed, 05 Jun 2013 15:09:01 GMT) Full text and rfc822 format available.

Message #19 received at 14555 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Shahid Hussain <shnx88 <at> gmail.com>
Cc: 14555 <at> debbugs.gnu.org
Subject: Re: bug#14555: Facing Some problem in uniq command
Date: Wed, 5 Jun 2013 09:06:38 -0600

Shahid Hussain wrote:
> Appreciate your quick reply. What exactly i m doing is there are so many
> files in my product which contains some data in "name =  value" format. By
> using some pattern i m extracting only "value" field from all files and
> redirecting the output to one temporarily file as i do not want any value
> to be repeated in any file. And here i m applying uniq command to this
> temporary file (by pipe lining sort [sort |uniq -c tempFile]) But i am
> unable to get expected result.

It might be better if in your script you set:

  #!/bin/sh
  LC_ALL=C
  export LC_ALL
  ...
  sort | uniq
  ...

That will force a standard sort order everywhere in your script.

> But as you have told whitespace also should be identical at every line so
> this might be the problem in my case. Because when i displayed content of
> file using cat command and manually copied the same data to another file
> and then tried uniq with sort command it works fine.

Without knowing enough about your data a quick and dirty hack to clean
up whitespace might be to pass it through awk.

  awk '{print$1}' somefile1 | sort | uniq ...

Since awk splits on whitespace this will only print the first field
and any whitespace or additional anything will be discarded.

> So it is fine for me but it would be too better if there could be an option
> in uniq command to work fine even if  whitespace is not identical :).

No.  The way is not to use an option.  The way is to prepare the data
without whitespace differences.  You have the option of using tools
like awk to split on whitespace while preparing the data.  Preparing
the data to avoid whitespace differences is the right option to use.

Bob

Information forwarded to bug-coreutils <at> gnu.org:
bug#14555; Package coreutils. (Tue, 23 Oct 2018 22:42:02 GMT) Full text and rfc822 format available.

Message #22 received at 14555 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: 14555 <at> debbugs.gnu.org, Shahid Hussain <shnx88 <at> gmail.com>
Subject: Re: bug#14555: Facing Some problem in uniq command
Date: Tue, 23 Oct 2018 16:41:08 -0600

close 14555
stop

(triaging old bugs)

On 05/06/13 09:06 AM, Bob Proulx wrote:
> Shahid Hussain wrote:
>> Appreciate your quick reply. What exactly i m doing is there are so many
>> files in my product which contains some data in "name =  value" format. By
>> using some pattern i m extracting only "value" field from all files and
>> redirecting the output to one temporarily file as i do not want any value
>> to be repeated in any file. And here i m applying uniq command to this
>> temporary file (by pipe lining sort [sort |uniq -c tempFile]) But i am
>> unable to get expected result.
> 
> It might be better if in your script you set:
> 
>    #!/bin/sh
>    LC_ALL=C
>    export LC_ALL
>    ...
>    sort | uniq
>    ...
> 
> That will force a standard sort order everywhere in your script.
> 
>> But as you have told whitespace also should be identical at every line so
>> this might be the problem in my case. Because when i displayed content of
>> file using cat command and manually copied the same data to another file
>> and then tried uniq with sort command it works fine.
> 
> Without knowing enough about your data a quick and dirty hack to clean
> up whitespace might be to pass it through awk.
> 
>    awk '{print$1}' somefile1 | sort | uniq ...
> 
> Since awk splits on whitespace this will only print the first field
> and any whitespace or additional anything will be discarded.
> 
>> So it is fine for me but it would be too better if there could be an option
>> in uniq command to work fine even if  whitespace is not identical :).
> 
> No.  The way is not to use an option.  The way is to prepare the data
> without whitespace differences.  You have the option of using tools
> like awk to split on whitespace while preparing the data.  Preparing
> the data to avoid whitespace differences is the right option to use.
> 

With no further comments in 5 years, I'm closing this bug.

-assaf

bug closed, send any further explanations to 14555 <at> debbugs.gnu.org and Shahid Hussain <shnx88 <at> gmail.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 04:30:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 27 Nov 2018 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 256 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #14555 Facing Some problem in uniq command

GNU bug report logs - #14555
Facing Some problem in uniq command