GNU bug report logs - #32099
uniq: add hash-based implementation

Previous Next

Package: coreutils;

Reported by: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>

Date: Sun, 8 Jul 2018 22:22:01 UTC

Severity: wishlist

Tags: notabug

Merged with 32101

Full log


Message #8 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>, 32099 <at> debbugs.gnu.org
Subject: Re: bug#32099: New uniq option for speed
Date: Mon, 9 Jul 2018 09:35:35 -0700
Kingsley G. Morse Jr. wrote:
>      $ echo -e "b\na\nb" | awk '!seen[$0]++'
> 
> It basically avoids sorting by using hashed
> indexes into an associative array to find
> previously seen values in about O(N) time.

No, it's still O(N log N) because hash table lookup is really O(log N), despite 
what many textbooks say. Though no doubt we could make it faster than than the 
sorting pipeline, it wouldn't be algorithmically faster. See, for example:

https://lemire.me/blog/2009/08/18/do-hash-tables-work-in-constant-time/




This bug report was last modified 6 years and 313 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.