GNU bug report logs - #32099
uniq: add hash-based implementation

Previous Next

Package: coreutils;

Reported by: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>

Date: Sun, 8 Jul 2018 22:22:01 UTC

Severity: wishlist

Tags: notabug

Merged with 32101

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>
To: bug-coreutils <at> gnu.org
Subject: New uniq option for speed
Date: Sun, 8 Jul 2018 14:03:56 -0700
Thank you very much for maintaining coreutils!

It seems to me that it would be hard to
underestimate how useful it is.


The main reason I'm writing is to humbly suggest
a performance improvement to the "uniq" command.

It currently requires its input to be sorted.

Sorting generally takes about O(N*log(N)) time.

However, it seems to me that sorting is
unnecessary.

You can see this is so in the following example

    $ echo -e "b\na\nb" | awk '!seen[$0]++'

It basically avoids sorting by using hashed
indexes into an associative array to find
previously seen values in about O(N) time.

I expect it would be about O(log(N)) times faster
than the sorting the uniq command currently
requires.

For big values of N, the time saved could be
substantial.


Humble suggestion.

Add an option to the "uniq" command.

Let it work quickly with unsorted data.

Maybe the new option could be something like
"-usi" for "unsorted input".

Thanks,
Kingsley

-- 
Time is the fire in which we all burn.





This bug report was last modified 6 years and 312 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.