#32099 - uniq: add hash-based implementation

GNU bug report logs - #32099
uniq: add hash-based implementation

Reported by: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>

Date: Sun, 8 Jul 2018 22:22:01 UTC

Severity: wishlist

Tags: notabug

Merged with 32101

View this message in rfc822 format

From: "Kingsley G. Morse Jr." <kingsley <at> loaner.com> To: Assaf Gordon <assafgordon <at> gmail.com> Cc: 32099 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu> Subject: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed) Date: Mon, 9 Jul 2018 21:29:53 -0700

[Message part 1 (text/plain, inline)]

Hi Assaf, I like datamash. And I like avoiding a pipe with sort's "-u" option. And I like your benchmarks. Mine are humbly attached. They compare sorting to hashing. At the moment, hashing seems to me to be about 70% faster. And to scale better. I'd still like uniq to be faster and free from sort. Feel free to let me know if I screwed up. Thanks, Kingsley On 07/09/2018 14:53, Assaf Gordon wrote: > Ηello, > > On 08/07/18 03:03 PM, Kingsley G. Morse Jr. wrote: > > The main reason I'm writing is to humbly suggest > > a performance improvement to the "uniq" command. > > > > It currently requires its input to be sorted. > > > > Sorting generally takes about O(N*log(N)) time. > > > > However, it seems to me that sorting is > > unnecessary. > > > > You can see this is so in the following example > > > > $ echo -e "b\na\nb" | awk '!seen[$0]++' > > In addition to Paul's reply, here are couple of more practical issues: > > 1. > GNU sort has several non trivial (and not obvious) advantages: > > It can sort a file larger than available memory (i.e. you > can sort a 3GB file on machine with 1GB of RAM). > > It can sort using multiple threads, making sorting faster. > > If you have a powerful machine (lots of ram and lots of CPUs), > you can sort entirely in memory like so: > > sort -S 10GB --parallel=10 -T /foo/bar INPUT > OUTPUT > > The "-S" sets the maximum amount of memory sort is allowed to use, > The "--parallel" sets the number of parallel sorting threads, > The "-T" points to a non-existing directory - ensuring that > if sort runs out of memory (input too large) then it will fail > instead of resorting to using disk storage (and slower I/O). > > There are many other combinations (e.g. "--compress-program"). > > A simple hash implementation will not be able to do so. > > > 2. > Sort already supports printing only unique values with the "-u" option. > So by using the correct combination of keys (-k) and unique (-u) > you can get unique values without even invoking "uniq" > (if your concert is starting another process). > > Note that uniq will compare the entire line, and sort will "unique" > only the specified "keys", but sort also has last the "--stable" > option that can affect the results. > > > 3. > Sometimes you really just want to see the list of unique values, > and that's valid. > But often times you want to later examine or do something with the list > of unique values, and then it is common to sort it. > > A hash implementation of "unique" will not print sorted items, > and then you'll likely need to pipe it to another "sort" anyhow > (perhaps with much smaller number of items, but still need sort). > > > 4. > The Unix philosophy often says > "Write programs that do one thing and do it well." > ( https://en.wikipedia.org/wiki/Unix_philosophy ) > > GNU Sort does sorting very well. > Other programs that require sorted input can rely on it (e.g. join, > uniq, etc.). > > > 5. > Using your awk example is actually a fantastic way to achieve > what you wanted - it fits perfectly in "do one thing and do it well". > > Note that If you're using a recent Debian or Ubuntu machine, > they have switched the default awk implementation from GNU awk (gawk) > to "mawk". "mawk" is indeed faster in some aspects, but it seems gawk is > much faster when it comes to hashing. > > Observe the following: > > $ seq 1000000 | time -p gawk '!seen[$0]++' > /dev/null > real 0.40 > user 0.35 > sys 0.04 > $ seq 1000000 | time -p mawk '!seen[$0]++' > /dev/null > real 10.48 > user 10.40 > sys 0.07 > > Using awk will also enable you to later extend your task to > achieve more. Eg. the following program: > awk 'NR==FNR{seen[$1]++;next} seen[$1]>0' a.txt b.txt > > Will only print lines from "b.txt" which have a key matching from > file "a.txt". kind of a hash-based join between two files. > > > 6. > Lastly, if you still want a program that uses hashing to discard > duplicates in a file, may I humbly suggest GNU datamash: > https://www.gnu.org/software/datamash/ > (disclaimer: I'm datamash's developer). > > It can easily remove duplicates, and it uses the same hashing code > that other coreutils program use. Example: > > $ printf "%s\n" a a b a b c b | datamash rmdup 1 > a > b > c > > Datamash has several additional useful features, > for example it can remove duplicates from a specific column but still print > the entire matching line: > > $ printf "FOO %s %s\n" a 1 a 2 b 3 a 4 b 5 c 6 b 7 \ > | datamash -W rmdup 2 > FOO a 1 > FOO b 3 > FOO c 6 > > > > Hope this helps, > regards, > - assaf > -- Time is the fire in which we all burn.

[uniq_demo (text/plain, attachment)]

This bug report was last modified 6 years and 312 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #32099 uniq: add hash-based implementation

GNU bug report logs - #32099
uniq: add hash-based implementation