GNU bug report logs - #10287
[wishlist] uniq can remove non adjacent lines

Previous Next

Package: coreutils;

Reported by: Stéphane Blondon <stephane.blondon <at> gmail.com>

Date: Tue, 13 Dec 2011 02:52:01 UTC

Severity: wishlist

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 10287 in the body.
You can then email your comments to 10287 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#10287; Package coreutils. (Tue, 13 Dec 2011 02:52:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stéphane Blondon <stephane.blondon <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 13 Dec 2011 02:52:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stéphane Blondon <stephane.blondon <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: [wishlist] uniq can remove non adjacent lines
Date: Mon, 12 Dec 2011 23:54:57 +0100
Tool: uniq
Priority: wishlist

Hello,

I think `uniq` should have an additional option (for example -a,
--all) to remove same lines but not adjacent.

The man page explains a workaround based on `sort` but it can be
complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
the sort couldn't really work. Fortunately, the order was not
important so using `sort | uniq | sort --random-sort` was an
acceptable solution. I imagine cases based on other tools like `top`
could be a problem too.

If you are interested, I could try to provide a patch. (I have learnt
C but I don't use it today.)

I don't think the increase of memory use is a problem today, so a
warning in the manpage should be enought.


Thank for all,
-- 
Stéphane




Information forwarded to bug-coreutils <at> gnu.org:
bug#10287; Package coreutils. (Tue, 13 Dec 2011 04:22:01 GMT) Full text and rfc822 format available.

Message #8 received at 10287 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Stéphane Blondon <stephane.blondon <at> gmail.com>
Cc: 10287 <at> debbugs.gnu.org
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Mon, 12 Dec 2011 21:20:18 -0700
Stéphane Blondon wrote:
> I think `uniq` should have an additional option (for example -a,
> --all) to remove same lines but not adjacent.
> 
> The man page explains a workaround based on `sort` but it can be
> complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
> the sort couldn't really work. Fortunately, the order was not
> important so using `sort | uniq | sort --random-sort` was an
> acceptable solution. I imagine cases based on other tools like `top`
> could be a problem too.

If you want to print only the first of a unique line then this perl
one-liner will do it.

  perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#10287; Package coreutils. (Tue, 13 Dec 2011 08:32:02 GMT) Full text and rfc822 format available.

Message #11 received at 10287 <at> debbugs.gnu.org (full text, mbox):

From: Davide Brini <dave_br <at> gmx.com>
To: 10287 <at> debbugs.gnu.org
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Tue, 13 Dec 2011 09:29:51 +0100
On Mon, 12 Dec 2011 21:20:18 -0700, Bob Proulx <bob <at> proulx.com> wrote:

> Stéphane Blondon wrote:
> > I think `uniq` should have an additional option (for example -a,
> > --all) to remove same lines but not adjacent.
> > 
> > The man page explains a workaround based on `sort` but it can be
> > complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
> > the sort couldn't really work. Fortunately, the order was not
> > important so using `sort | uniq | sort --random-sort` was an
> > acceptable solution. I imagine cases based on other tools like `top`
> > could be a problem too.
> 
> If you want to print only the first of a unique line then this perl
> one-liner will do it.
> 
>   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'

While we're at it, this is the typical awk way to do that:

awk '!a[$0]++'


-- 
D.




Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Tue, 13 Dec 2011 08:47:02 GMT) Full text and rfc822 format available.

Notification sent to Stéphane Blondon <stephane.blondon <at> gmail.com>:
bug acknowledged by developer. (Tue, 13 Dec 2011 08:47:02 GMT) Full text and rfc822 format available.

Message #16 received at 10287-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Stéphane Blondon <stephane.blondon <at> gmail.com>
Cc: 10287-done <at> debbugs.gnu.org
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Tue, 13 Dec 2011 08:45:24 +0000
On 12/12/2011 10:54 PM, Stéphane Blondon wrote:
> Tool: uniq
> Priority: wishlist
> 
> Hello,
> 
> I think `uniq` should have an additional option (for example -a,
> --all) to remove same lines but not adjacent.
> 
> The man page explains a workaround based on `sort` but it can be
> complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
> the sort couldn't really work. Fortunately, the order was not
> important so using `sort | uniq | sort --random-sort` was an
> acceptable solution. I imagine cases based on other tools like `top`
> could be a problem too.
> 
> If you are interested, I could try to provide a patch. (I have learnt
> C but I don't use it today.)
> 
> I don't think the increase of memory use is a problem today, so a
> warning in the manpage should be enought.

Well that would increase the complexity of `uniq` a _lot_
http://lists.gnu.org/archive/html/coreutils/2011-11/msg00018.html
For that reason I would be against adding such a feature.
Note improving the field selection of `uniq` is appropriate,
and would make DSU solutions using sort, easier to implement.

cheers,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10287; Package coreutils. (Tue, 13 Dec 2011 08:48:01 GMT) Full text and rfc822 format available.

Message #19 received at 10287 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Bob Proulx <bob <at> proulx.com>
Cc: 10287 <at> debbugs.gnu.org,
	Stéphane Blondon <stephane.blondon <at> gmail.com>
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Tue, 13 Dec 2011 09:46:14 +0100
Bob Proulx wrote:

> Stéphane Blondon wrote:
>> I think `uniq` should have an additional option (for example -a,
>> --all) to remove same lines but not adjacent.
>>
>> The man page explains a workaround based on `sort` but it can be
>> complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
>> the sort couldn't really work. Fortunately, the order was not
>> important so using `sort | uniq | sort --random-sort` was an
>> acceptable solution. I imagine cases based on other tools like `top`
>> could be a problem too.
>
> If you want to print only the first of a unique line then this perl
> one-liner will do it.
>
>   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'

Thanks, but with large files, isn't it better to store not
the full line, but rather a constant?

  perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'

(actually, using "1" could be seen as misleading, since 0 or even undef
would also work)

I think you can drop the "l".
I have a slight preference for this:

  perl -ne 'defined $seen{$_} or print; $seen{$_}=1'




Information forwarded to bug-coreutils <at> gnu.org:
bug#10287; Package coreutils. (Tue, 13 Dec 2011 18:09:01 GMT) Full text and rfc822 format available.

Message #22 received at 10287 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: 10287 <at> debbugs.gnu.org,
	Stéphane Blondon <stephane.blondon <at> gmail.com>
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Tue, 13 Dec 2011 11:06:50 -0700
Jim Meyering wrote:
> Bob Proulx wrote:
> > If you want to print only the first of a unique line then this perl
> > one-liner will do it.
> >
> >   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
> 
> Thanks, but with large files, isn't it better to store not
> the full line, but rather a constant?
> 
>   perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'

Good point!  I hadn't given it much thought since it usually runs so
quickly in my usage that I never worried about it.

> (actually, using "1" could be seen as misleading, since 0 or even undef
> would also work)
> 
> I think you can drop the "l".
> I have a slight preference for this:
> 
>   perl -ne 'defined $seen{$_} or print; $seen{$_}=1'

Refering to "print" v. "print $_" here I have never liked implicit use
of $_ and so I tend to avoid it.  At one time there was a push in the
perl community to make all uses explicit.  And as to whether to use
the 'if (expr) { stmt }' or 'stmt if expr' or 'expr or stmt' forms is
a matter of taste.  Might as well discuss the one true indention and
brace styles.  :-)  For one-liners I do tend to use short variables
to keep the line length minimized.  In order to compact a line I also
sacrifice whitespace when required.

But you have me thinking about conserving memory.  If the file was
large due to long lines then memory use would be proportionately large
due to the key storage needs.  This could be reduced by using a hash
of the line as the storage key instead of the entire line.  But the
savings would be relative to the average line size.  If the average
line size was smaller than the hash size then this would increase
memory use.

  perl -MDigest::MD5=md5 -lne '$m=md5($_); print $_ if ! defined $a{$m}; $a{$m}=1'

If you are ever going to debug and print out the md5 value then
substitute md5_hex for md5 to get a printable result.

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#10287; Package coreutils. (Tue, 13 Dec 2011 18:12:01 GMT) Full text and rfc822 format available.

Message #25 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Tue, 13 Dec 2011 11:09:55 -0700
Davide Brini wrote:
> Bob Proulx wrote:
> >   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
> 
> While we're at it, this is the typical awk way to do that:
> 
> awk '!a[$0]++'

I like it!  I will definitely be using that awk idiom in the future.
It is simple and concise.

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#10287; Package coreutils. (Wed, 14 Dec 2011 22:33:02 GMT) Full text and rfc822 format available.

Message #28 received at 10287 <at> debbugs.gnu.org (full text, mbox):

From: Stéphane Blondon <stephane.blondon <at> gmail.com>
To: 10287 <at> debbugs.gnu.org
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Wed, 14 Dec 2011 23:30:12 +0100
2011/12/13 Bob Proulx <bob <at> proulx.com>:
> Davide Brini wrote:
>> Bob Proulx wrote:
>> >   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
>>
>> While we're at it, this is the typical awk way to do that:
>>
>> awk '!a[$0]++'

Very great thanks to you and David about providing a one-liner
solution! I've modified the awk version in order it works as an alias.
I send it in case some one asks the same question:

Copy-paste the next line in ~/.bash_aliases:
alias uniqall='awk '"'"'! a[$0]++'"'"''

Then you can filter like that:
cat file | ... | uniqall | ...


(tested with bash, version 4.2.20(1)-release under Debian Wheezy)

Thanks and good bye,
-- 
Stéphane




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 12 Jan 2012 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 245 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.