GNU bug report logs -
#10287
[wishlist] uniq can remove non adjacent lines
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 10287 in the body.
You can then email your comments to 10287 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#10287
; Package
coreutils
.
(Tue, 13 Dec 2011 02:52:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stéphane Blondon <stephane.blondon <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Tue, 13 Dec 2011 02:52:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Tool: uniq
Priority: wishlist
Hello,
I think `uniq` should have an additional option (for example -a,
--all) to remove same lines but not adjacent.
The man page explains a workaround based on `sort` but it can be
complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
the sort couldn't really work. Fortunately, the order was not
important so using `sort | uniq | sort --random-sort` was an
acceptable solution. I imagine cases based on other tools like `top`
could be a problem too.
If you are interested, I could try to provide a patch. (I have learnt
C but I don't use it today.)
I don't think the increase of memory use is a problem today, so a
warning in the manpage should be enought.
Thank for all,
--
Stéphane
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10287
; Package
coreutils
.
(Tue, 13 Dec 2011 04:22:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 10287 <at> debbugs.gnu.org (full text, mbox):
Stéphane Blondon wrote:
> I think `uniq` should have an additional option (for example -a,
> --all) to remove same lines but not adjacent.
>
> The man page explains a workaround based on `sort` but it can be
> complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
> the sort couldn't really work. Fortunately, the order was not
> important so using `sort | uniq | sort --random-sort` was an
> acceptable solution. I imagine cases based on other tools like `top`
> could be a problem too.
If you want to print only the first of a unique line then this perl
one-liner will do it.
perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
Bob
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10287
; Package
coreutils
.
(Tue, 13 Dec 2011 08:32:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 10287 <at> debbugs.gnu.org (full text, mbox):
On Mon, 12 Dec 2011 21:20:18 -0700, Bob Proulx <bob <at> proulx.com> wrote:
> Stéphane Blondon wrote:
> > I think `uniq` should have an additional option (for example -a,
> > --all) to remove same lines but not adjacent.
> >
> > The man page explains a workaround based on `sort` but it can be
> > complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
> > the sort couldn't really work. Fortunately, the order was not
> > important so using `sort | uniq | sort --random-sort` was an
> > acceptable solution. I imagine cases based on other tools like `top`
> > could be a problem too.
>
> If you want to print only the first of a unique line then this perl
> one-liner will do it.
>
> perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
While we're at it, this is the typical awk way to do that:
awk '!a[$0]++'
--
D.
Reply sent
to
Pádraig Brady <P <at> draigBrady.com>
:
You have taken responsibility.
(Tue, 13 Dec 2011 08:47:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Stéphane Blondon <stephane.blondon <at> gmail.com>
:
bug acknowledged by developer.
(Tue, 13 Dec 2011 08:47:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 10287-done <at> debbugs.gnu.org (full text, mbox):
On 12/12/2011 10:54 PM, Stéphane Blondon wrote:
> Tool: uniq
> Priority: wishlist
>
> Hello,
>
> I think `uniq` should have an additional option (for example -a,
> --all) to remove same lines but not adjacent.
>
> The man page explains a workaround based on `sort` but it can be
> complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
> the sort couldn't really work. Fortunately, the order was not
> important so using `sort | uniq | sort --random-sort` was an
> acceptable solution. I imagine cases based on other tools like `top`
> could be a problem too.
>
> If you are interested, I could try to provide a patch. (I have learnt
> C but I don't use it today.)
>
> I don't think the increase of memory use is a problem today, so a
> warning in the manpage should be enought.
Well that would increase the complexity of `uniq` a _lot_
http://lists.gnu.org/archive/html/coreutils/2011-11/msg00018.html
For that reason I would be against adding such a feature.
Note improving the field selection of `uniq` is appropriate,
and would make DSU solutions using sort, easier to implement.
cheers,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10287
; Package
coreutils
.
(Tue, 13 Dec 2011 08:48:01 GMT)
Full text and
rfc822 format available.
Message #19 received at 10287 <at> debbugs.gnu.org (full text, mbox):
Bob Proulx wrote:
> Stéphane Blondon wrote:
>> I think `uniq` should have an additional option (for example -a,
>> --all) to remove same lines but not adjacent.
>>
>> The man page explains a workaround based on `sort` but it can be
>> complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
>> the sort couldn't really work. Fortunately, the order was not
>> important so using `sort | uniq | sort --random-sort` was an
>> acceptable solution. I imagine cases based on other tools like `top`
>> could be a problem too.
>
> If you want to print only the first of a unique line then this perl
> one-liner will do it.
>
> perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
Thanks, but with large files, isn't it better to store not
the full line, but rather a constant?
perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'
(actually, using "1" could be seen as misleading, since 0 or even undef
would also work)
I think you can drop the "l".
I have a slight preference for this:
perl -ne 'defined $seen{$_} or print; $seen{$_}=1'
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10287
; Package
coreutils
.
(Tue, 13 Dec 2011 18:09:01 GMT)
Full text and
rfc822 format available.
Message #22 received at 10287 <at> debbugs.gnu.org (full text, mbox):
Jim Meyering wrote:
> Bob Proulx wrote:
> > If you want to print only the first of a unique line then this perl
> > one-liner will do it.
> >
> > perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
>
> Thanks, but with large files, isn't it better to store not
> the full line, but rather a constant?
>
> perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'
Good point! I hadn't given it much thought since it usually runs so
quickly in my usage that I never worried about it.
> (actually, using "1" could be seen as misleading, since 0 or even undef
> would also work)
>
> I think you can drop the "l".
> I have a slight preference for this:
>
> perl -ne 'defined $seen{$_} or print; $seen{$_}=1'
Refering to "print" v. "print $_" here I have never liked implicit use
of $_ and so I tend to avoid it. At one time there was a push in the
perl community to make all uses explicit. And as to whether to use
the 'if (expr) { stmt }' or 'stmt if expr' or 'expr or stmt' forms is
a matter of taste. Might as well discuss the one true indention and
brace styles. :-) For one-liners I do tend to use short variables
to keep the line length minimized. In order to compact a line I also
sacrifice whitespace when required.
But you have me thinking about conserving memory. If the file was
large due to long lines then memory use would be proportionately large
due to the key storage needs. This could be reduced by using a hash
of the line as the storage key instead of the entire line. But the
savings would be relative to the average line size. If the average
line size was smaller than the hash size then this would increase
memory use.
perl -MDigest::MD5=md5 -lne '$m=md5($_); print $_ if ! defined $a{$m}; $a{$m}=1'
If you are ever going to debug and print out the md5 value then
substitute md5_hex for md5 to get a printable result.
Bob
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10287
; Package
coreutils
.
(Tue, 13 Dec 2011 18:12:01 GMT)
Full text and
rfc822 format available.
Message #25 received at submit <at> debbugs.gnu.org (full text, mbox):
Davide Brini wrote:
> Bob Proulx wrote:
> > perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
>
> While we're at it, this is the typical awk way to do that:
>
> awk '!a[$0]++'
I like it! I will definitely be using that awk idiom in the future.
It is simple and concise.
Bob
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10287
; Package
coreutils
.
(Wed, 14 Dec 2011 22:33:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 10287 <at> debbugs.gnu.org (full text, mbox):
2011/12/13 Bob Proulx <bob <at> proulx.com>:
> Davide Brini wrote:
>> Bob Proulx wrote:
>> > perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
>>
>> While we're at it, this is the typical awk way to do that:
>>
>> awk '!a[$0]++'
Very great thanks to you and David about providing a one-liner
solution! I've modified the awk version in order it works as an alias.
I send it in case some one asks the same question:
Copy-paste the next line in ~/.bash_aliases:
alias uniqall='awk '"'"'! a[$0]++'"'"''
Then you can filter like that:
cat file | ... | uniqall | ...
(tested with bash, version 4.2.20(1)-release under Debian Wheezy)
Thanks and good bye,
--
Stéphane
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 12 Jan 2012 12:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 13 years and 245 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.