GNU bug report logs - #10287
[wishlist] uniq can remove non adjacent lines

Previous Next

Package: coreutils;

Reported by: Stéphane Blondon <stephane.blondon <at> gmail.com>

Date: Tue, 13 Dec 2011 02:52:01 UTC

Severity: wishlist

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #22 received at 10287 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: 10287 <at> debbugs.gnu.org,
	Stéphane Blondon <stephane.blondon <at> gmail.com>
Subject: Re: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Tue, 13 Dec 2011 11:06:50 -0700
Jim Meyering wrote:
> Bob Proulx wrote:
> > If you want to print only the first of a unique line then this perl
> > one-liner will do it.
> >
> >   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
> 
> Thanks, but with large files, isn't it better to store not
> the full line, but rather a constant?
> 
>   perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'

Good point!  I hadn't given it much thought since it usually runs so
quickly in my usage that I never worried about it.

> (actually, using "1" could be seen as misleading, since 0 or even undef
> would also work)
> 
> I think you can drop the "l".
> I have a slight preference for this:
> 
>   perl -ne 'defined $seen{$_} or print; $seen{$_}=1'

Refering to "print" v. "print $_" here I have never liked implicit use
of $_ and so I tend to avoid it.  At one time there was a push in the
perl community to make all uses explicit.  And as to whether to use
the 'if (expr) { stmt }' or 'stmt if expr' or 'expr or stmt' forms is
a matter of taste.  Might as well discuss the one true indention and
brace styles.  :-)  For one-liners I do tend to use short variables
to keep the line length minimized.  In order to compact a line I also
sacrifice whitespace when required.

But you have me thinking about conserving memory.  If the file was
large due to long lines then memory use would be proportionately large
due to the key storage needs.  This could be reduced by using a hash
of the line as the storage key instead of the entire line.  But the
savings would be relative to the average line size.  If the average
line size was smaller than the hash size then this would increase
memory use.

  perl -MDigest::MD5=md5 -lne '$m=md5($_); print $_ if ! defined $a{$m}; $a{$m}=1'

If you are ever going to debug and print out the md5 value then
substitute md5_hex for md5 to get a printable result.

Bob




This bug report was last modified 13 years and 246 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.