GNU bug report logs -
#32099
uniq: add hash-based implementation
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Hi Assaf,
I like datamash.
And I like avoiding a pipe with sort's "-u"
option.
And I like your benchmarks.
Mine are humbly attached.
They compare sorting to hashing.
At the moment, hashing seems to me to be about 70%
faster.
And to scale better.
I'd still like uniq to be faster and free from
sort.
Feel free to let me know if I screwed up.
Thanks,
Kingsley
On 07/09/2018 14:53, Assaf Gordon wrote:
> Ηello,
>
> On 08/07/18 03:03 PM, Kingsley G. Morse Jr. wrote:
> > The main reason I'm writing is to humbly suggest
> > a performance improvement to the "uniq" command.
> >
> > It currently requires its input to be sorted.
> >
> > Sorting generally takes about O(N*log(N)) time.
> >
> > However, it seems to me that sorting is
> > unnecessary.
> >
> > You can see this is so in the following example
> >
> > $ echo -e "b\na\nb" | awk '!seen[$0]++'
>
> In addition to Paul's reply, here are couple of more practical issues:
>
> 1.
> GNU sort has several non trivial (and not obvious) advantages:
>
> It can sort a file larger than available memory (i.e. you
> can sort a 3GB file on machine with 1GB of RAM).
>
> It can sort using multiple threads, making sorting faster.
>
> If you have a powerful machine (lots of ram and lots of CPUs),
> you can sort entirely in memory like so:
>
> sort -S 10GB --parallel=10 -T /foo/bar INPUT > OUTPUT
>
> The "-S" sets the maximum amount of memory sort is allowed to use,
> The "--parallel" sets the number of parallel sorting threads,
> The "-T" points to a non-existing directory - ensuring that
> if sort runs out of memory (input too large) then it will fail
> instead of resorting to using disk storage (and slower I/O).
>
> There are many other combinations (e.g. "--compress-program").
>
> A simple hash implementation will not be able to do so.
>
>
> 2.
> Sort already supports printing only unique values with the "-u" option.
> So by using the correct combination of keys (-k) and unique (-u)
> you can get unique values without even invoking "uniq"
> (if your concert is starting another process).
>
> Note that uniq will compare the entire line, and sort will "unique"
> only the specified "keys", but sort also has last the "--stable"
> option that can affect the results.
>
>
> 3.
> Sometimes you really just want to see the list of unique values,
> and that's valid.
> But often times you want to later examine or do something with the list
> of unique values, and then it is common to sort it.
>
> A hash implementation of "unique" will not print sorted items,
> and then you'll likely need to pipe it to another "sort" anyhow
> (perhaps with much smaller number of items, but still need sort).
>
>
> 4.
> The Unix philosophy often says
> "Write programs that do one thing and do it well."
> ( https://en.wikipedia.org/wiki/Unix_philosophy )
>
> GNU Sort does sorting very well.
> Other programs that require sorted input can rely on it (e.g. join,
> uniq, etc.).
>
>
> 5.
> Using your awk example is actually a fantastic way to achieve
> what you wanted - it fits perfectly in "do one thing and do it well".
>
> Note that If you're using a recent Debian or Ubuntu machine,
> they have switched the default awk implementation from GNU awk (gawk)
> to "mawk". "mawk" is indeed faster in some aspects, but it seems gawk is
> much faster when it comes to hashing.
>
> Observe the following:
>
> $ seq 1000000 | time -p gawk '!seen[$0]++' > /dev/null
> real 0.40
> user 0.35
> sys 0.04
> $ seq 1000000 | time -p mawk '!seen[$0]++' > /dev/null
> real 10.48
> user 10.40
> sys 0.07
>
> Using awk will also enable you to later extend your task to
> achieve more. Eg. the following program:
> awk 'NR==FNR{seen[$1]++;next} seen[$1]>0' a.txt b.txt
>
> Will only print lines from "b.txt" which have a key matching from
> file "a.txt". kind of a hash-based join between two files.
>
>
> 6.
> Lastly, if you still want a program that uses hashing to discard
> duplicates in a file, may I humbly suggest GNU datamash:
> https://www.gnu.org/software/datamash/
> (disclaimer: I'm datamash's developer).
>
> It can easily remove duplicates, and it uses the same hashing code
> that other coreutils program use. Example:
>
> $ printf "%s\n" a a b a b c b | datamash rmdup 1
> a
> b
> c
>
> Datamash has several additional useful features,
> for example it can remove duplicates from a specific column but still print
> the entire matching line:
>
> $ printf "FOO %s %s\n" a 1 a 2 b 3 a 4 b 5 c 6 b 7 \
> | datamash -W rmdup 2
> FOO a 1
> FOO b 3
> FOO c 6
>
>
>
> Hope this helps,
> regards,
> - assaf
>
--
Time is the fire in which we all burn.
[hash_v_sort_benchmarks.1.png (image/png, attachment)]
[hash_v_sort_benchmarks.2.png (image/png, attachment)]
[uniq_demo (text/plain, attachment)]
[hash_and_sort_benchmarks.gnumeric (application/gnumeric, attachment)]
This bug report was last modified 6 years and 312 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.