GNU bug report logs - #32099
uniq: add hash-based implementation

Previous Next

Package: coreutils;

Reported by: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>

Date: Sun, 8 Jul 2018 22:22:01 UTC

Severity: wishlist

Tags: notabug

Merged with 32101

To reply to this bug, email your comments to 32099 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Sun, 08 Jul 2018 22:22:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Kingsley G. Morse Jr." <kingsley <at> loaner.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 08 Jul 2018 22:22:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>
To: bug-coreutils <at> gnu.org
Subject: New uniq option for speed
Date: Sun, 8 Jul 2018 14:03:56 -0700

Thank you very much for maintaining coreutils!

It seems to me that it would be hard to
underestimate how useful it is.


The main reason I'm writing is to humbly suggest
a performance improvement to the "uniq" command.

It currently requires its input to be sorted.

Sorting generally takes about O(N*log(N)) time.

However, it seems to me that sorting is
unnecessary.

You can see this is so in the following example

    $ echo -e "b\na\nb" | awk '!seen[$0]++'

It basically avoids sorting by using hashed
indexes into an associative array to find
previously seen values in about O(N) time.

I expect it would be about O(log(N)) times faster
than the sorting the uniq command currently
requires.

For big values of N, the time saved could be
substantial.


Humble suggestion.

Add an option to the "uniq" command.

Let it work quickly with unsorted data.

Maybe the new option could be something like
"-usi" for "unsorted input".

Thanks,
Kingsley

-- 
Time is the fire in which we all burn.

Information forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Mon, 09 Jul 2018 16:36:02 GMT) Full text and rfc822 format available.

Message #8 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>, 32099 <at> debbugs.gnu.org
Subject: Re: bug#32099: New uniq option for speed
Date: Mon, 9 Jul 2018 09:35:35 -0700

Kingsley G. Morse Jr. wrote:
>      $ echo -e "b\na\nb" | awk '!seen[$0]++'
> 
> It basically avoids sorting by using hashed
> indexes into an associative array to find
> previously seen values in about O(N) time.

No, it's still O(N log N) because hash table lookup is really O(log N), despite 
what many textbooks say. Though no doubt we could make it faster than than the 
sorting pipeline, it wouldn't be algorithmically faster. See, for example:

https://lemire.me/blog/2009/08/18/do-hash-tables-work-in-constant-time/

Information forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Mon, 09 Jul 2018 20:54:02 GMT) Full text and rfc822 format available.

Message #11 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>, 32099 <at> debbugs.gnu.org
Subject: Re: bug#32099: New uniq option for speed
Date: Mon, 9 Jul 2018 14:53:04 -0600

Ηello,

On 08/07/18 03:03 PM, Kingsley G. Morse Jr. wrote:
> The main reason I'm writing is to humbly suggest
> a performance improvement to the "uniq" command.
> 
> It currently requires its input to be sorted.
> 
> Sorting generally takes about O(N*log(N)) time.
> 
> However, it seems to me that sorting is
> unnecessary.
> 
> You can see this is so in the following example
> 
>      $ echo -e "b\na\nb" | awk '!seen[$0]++'

In addition to Paul's reply, here are couple of more practical issues:

1.
GNU sort has several non trivial (and not obvious) advantages:

It can sort a file larger than available memory (i.e. you
can sort a 3GB file on machine with 1GB of RAM).

It can sort using multiple threads, making sorting faster.

If you have a powerful machine (lots of ram and lots of CPUs),
you can sort entirely in memory like so:

    sort -S 10GB --parallel=10 -T /foo/bar INPUT > OUTPUT

The "-S" sets the maximum amount of memory sort is allowed to use,
The "--parallel" sets the number of parallel sorting threads,
The "-T" points to a non-existing directory - ensuring that
if sort runs out of memory (input too large) then it will fail
instead of resorting to using disk storage (and slower I/O).

There are many other combinations (e.g. "--compress-program").

A simple hash implementation will not be able to do so.

2.
Sort already supports printing only unique values with the "-u" option.
So by using the correct combination of keys (-k) and unique (-u)
you can get unique values without even invoking "uniq"
(if your concert is starting another process).

Note that uniq will compare the entire line, and sort will "unique"
only the specified "keys", but sort also has last the "--stable"
option that can affect the results.

3.
Sometimes you really just want to see the list of unique values,
and that's valid.
But often times you want to later examine or do something with the list
of unique values, and then it is common to sort it.

A hash implementation of "unique" will not print sorted items,
and then you'll likely need to pipe it to another "sort" anyhow
(perhaps with much smaller number of items, but still need sort).

4.
The Unix philosophy often says
  "Write programs that do one thing and do it well."
( https://en.wikipedia.org/wiki/Unix_philosophy )

GNU Sort does sorting very well.
Other programs that require sorted input can rely on it (e.g. join,
uniq, etc.).

5.
Using your awk example is actually a fantastic way to achieve
what you wanted - it fits perfectly in "do one thing and do it well".

Note that If you're using a recent Debian or Ubuntu machine,
they have switched the default awk implementation from GNU awk (gawk)
to "mawk". "mawk" is indeed faster in some aspects, but it seems gawk is
much faster when it comes to hashing.

Observe the following:

  $ seq 1000000 | time -p gawk '!seen[$0]++' > /dev/null
  real 0.40
  user 0.35
  sys 0.04
  $ seq 1000000 | time -p mawk '!seen[$0]++' > /dev/null
  real 10.48
  user 10.40
  sys 0.07

Using awk will also enable you to later extend your task to
achieve more. Eg. the following program:
  awk 'NR==FNR{seen[$1]++;next} seen[$1]>0' a.txt b.txt

Will only print lines from "b.txt" which have a key matching from
file "a.txt". kind of a hash-based join between two files.

6.
Lastly, if you still want a program that uses hashing to discard
duplicates in a file, may I humbly suggest GNU datamash:
   https://www.gnu.org/software/datamash/
(disclaimer: I'm datamash's developer).

It can easily remove duplicates, and it uses the same hashing code
that other coreutils program use. Example:

  $ printf "%s\n" a a b a b c b | datamash rmdup 1
  a
  b
  c

Datamash has several additional useful features,
for example it can remove duplicates from a specific column but still 
print the entire matching line:

  $ printf "FOO %s %s\n" a 1 a 2 b 3 a 4 b 5 c 6 b 7 \
        | datamash -W rmdup 2
  FOO a 1
  FOO b 3
  FOO c 6

Hope this helps,
regards,
 - assaf

Information forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Tue, 10 Jul 2018 04:31:02 GMT) Full text and rfc822 format available.

Message #14 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 32099 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Date: Mon, 9 Jul 2018 21:29:53 -0700

[Message part 1 (text/plain, inline)]

Hi Assaf,

I like datamash.

And I like avoiding a pipe with sort's "-u"
option.

And I like your benchmarks.

Mine are humbly attached.

They compare sorting to hashing.

At the moment, hashing seems to me to be about 70%
faster.

And to scale better.

I'd still like uniq to be faster and free from
sort.

Feel free to let me know if I screwed up.

Thanks,
Kingsley

On 07/09/2018 14:53, Assaf Gordon wrote:
> Ηello,
> 
> On 08/07/18 03:03 PM, Kingsley G. Morse Jr. wrote:
> > The main reason I'm writing is to humbly suggest
> > a performance improvement to the "uniq" command.
> > 
> > It currently requires its input to be sorted.
> > 
> > Sorting generally takes about O(N*log(N)) time.
> > 
> > However, it seems to me that sorting is
> > unnecessary.
> > 
> > You can see this is so in the following example
> > 
> >      $ echo -e "b\na\nb" | awk '!seen[$0]++'
> 
> In addition to Paul's reply, here are couple of more practical issues:
> 
> 1.
> GNU sort has several non trivial (and not obvious) advantages:
> 
> It can sort a file larger than available memory (i.e. you
> can sort a 3GB file on machine with 1GB of RAM).
> 
> It can sort using multiple threads, making sorting faster.
> 
> If you have a powerful machine (lots of ram and lots of CPUs),
> you can sort entirely in memory like so:
> 
>     sort -S 10GB --parallel=10 -T /foo/bar INPUT > OUTPUT
> 
> The "-S" sets the maximum amount of memory sort is allowed to use,
> The "--parallel" sets the number of parallel sorting threads,
> The "-T" points to a non-existing directory - ensuring that
> if sort runs out of memory (input too large) then it will fail
> instead of resorting to using disk storage (and slower I/O).
> 
> There are many other combinations (e.g. "--compress-program").
> 
> A simple hash implementation will not be able to do so.
> 
> 
> 2.
> Sort already supports printing only unique values with the "-u" option.
> So by using the correct combination of keys (-k) and unique (-u)
> you can get unique values without even invoking "uniq"
> (if your concert is starting another process).
> 
> Note that uniq will compare the entire line, and sort will "unique"
> only the specified "keys", but sort also has last the "--stable"
> option that can affect the results.
> 
> 
> 3.
> Sometimes you really just want to see the list of unique values,
> and that's valid.
> But often times you want to later examine or do something with the list
> of unique values, and then it is common to sort it.
> 
> A hash implementation of "unique" will not print sorted items,
> and then you'll likely need to pipe it to another "sort" anyhow
> (perhaps with much smaller number of items, but still need sort).
> 
> 
> 4.
> The Unix philosophy often says
>   "Write programs that do one thing and do it well."
> ( https://en.wikipedia.org/wiki/Unix_philosophy )
> 
> GNU Sort does sorting very well.
> Other programs that require sorted input can rely on it (e.g. join,
> uniq, etc.).
> 
> 
> 5.
> Using your awk example is actually a fantastic way to achieve
> what you wanted - it fits perfectly in "do one thing and do it well".
> 
> Note that If you're using a recent Debian or Ubuntu machine,
> they have switched the default awk implementation from GNU awk (gawk)
> to "mawk". "mawk" is indeed faster in some aspects, but it seems gawk is
> much faster when it comes to hashing.
> 
> Observe the following:
> 
>   $ seq 1000000 | time -p gawk '!seen[$0]++' > /dev/null
>   real 0.40
>   user 0.35
>   sys 0.04
>   $ seq 1000000 | time -p mawk '!seen[$0]++' > /dev/null
>   real 10.48
>   user 10.40
>   sys 0.07
> 
> Using awk will also enable you to later extend your task to
> achieve more. Eg. the following program:
>   awk 'NR==FNR{seen[$1]++;next} seen[$1]>0' a.txt b.txt
> 
> Will only print lines from "b.txt" which have a key matching from
> file "a.txt". kind of a hash-based join between two files.
> 
> 
> 6.
> Lastly, if you still want a program that uses hashing to discard
> duplicates in a file, may I humbly suggest GNU datamash:
>    https://www.gnu.org/software/datamash/
> (disclaimer: I'm datamash's developer).
> 
> It can easily remove duplicates, and it uses the same hashing code
> that other coreutils program use. Example:
> 
>   $ printf "%s\n" a a b a b c b | datamash rmdup 1
>   a
>   b
>   c
> 
> Datamash has several additional useful features,
> for example it can remove duplicates from a specific column but still print
> the entire matching line:
> 
>   $ printf "FOO %s %s\n" a 1 a 2 b 3 a 4 b 5 c 6 b 7 \
>         | datamash -W rmdup 2
>   FOO a 1
>   FOO b 3
>   FOO c 6
> 
> 
> 
> Hope this helps,
> regards,
>  - assaf
> 

-- 
Time is the fire in which we all burn.

[hash_v_sort_benchmarks.1.png (image/png, attachment)]

[hash_v_sort_benchmarks.2.png (image/png, attachment)]

[uniq_demo (text/plain, attachment)]

[hash_and_sort_benchmarks.gnumeric (application/gnumeric, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Tue, 10 Jul 2018 18:22:01 GMT) Full text and rfc822 format available.

Message #17 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>
Cc: 32099 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option
 for speed)
Date: Tue, 10 Jul 2018 12:21:14 -0600

tag 32099 notabug
severity 32099 wishlist
close 32099
stop

Hello Kingsley,

On 09/07/18 10:29 PM, Kingsley G. Morse Jr. wrote:
> And I like your benchmarks.
> 
> Mine are humbly attached.
> 
> They compare sorting to hashing.
> 
> At the moment, hashing seems to me to be about 70%
> faster.
> 
> And to scale better.
> 
> I'd still like uniq to be faster and free from
> sort.

Thank you for taking the time to create and provide these,
including the scripts to reproduce it.

I've marked this item as "closed / wishlist" - but only because it
is technically not a bug report. Discussion is welcomed to continue
by replying to this thread.

Being "faster" alone is important but not the only criteria (as
mentioned in the previous email's examples).

Your measurements are a good start towards stronger support to
hash-based uniq implementation (a working patch, btw, would be even
stronger).

I would suggest several improvements to the measurements, to ensure
relevant information is conveyed:

1.
Specify exactly which versions of awk/sort you are using
(e.g. which versions, whether from your distribution or compiled from 
source, if compiled than which compiler / optimization flags).
What CPUs is being used, how much memory does the computer have, etc.

2.
I would suggest adding "sort | uniq" to the charts, because that is the
real base-line you are trying to improve.
I would also consider adding "datamash rmdup" to the charts, because its
hash-based implementation could serve as a ball-park indication of how
a naive hash implementation would work in uniq.

3.
You input is very small (up to 90K lines of 10 characters each) - that
does not seem to stress-test any significant parts of the algorithms
(and for a typical user, waiting 0.05 seconds vs 0.15 seconds does not
matter).

I would recommend much larger input files, with much longer lines -
trying to simulate various real-world scenarios.

Here's a project of someone who tried to compare various hashing
functions: https://github.com/jhunt/hash/
Under the "corpus" directory you'll find few demo data files.
Even the largest one (qnames) is only 84MB - not very big.

Perhaps consider duplicating the content few times, then test it:
  for i in $(seq 1 5);do cat qnames; done > input

To get even closer to real-world input, try several input files
with different ratios of duplicated lines (your script create completely
random data - not very similar to real-world input files).

For example, what happens with many lines share similar sub-strings?
depending on the hashing implementation, this could lead to lower 
performance.

4.
It would be beneficial to see what's the memory requirements and limits
of a hash-based implementation. Currently, because 'sort' is not limited
by available RAM, there is no memory limit (though resorting to disk I/O
might make everything slow).
For example, is it possible to hash an 800M file on a machine with only 
1GB of RAM ? and without additional code, hashing files larger than 
available memory is simply not possible - a very big disadvantage.

5.
To compare "sort' fairly against other options,
I would add to the measurements using "--parallel 1" and later 
"--parallel N" (N being the number of cpu/cores you have).
Does that improve performance?

The "awk" (or datamash) programs do not have a built-in memory limit,
while sort does. It would be useful to specify a large amount of
memory for sort (using "-S") to ensure sort did not revert to disk I/O.

6.
The reported time in your script is the elapsed time (aka "wall time").
A more relevant option would be "user time" + "kernel time"
(optiosn %S and %U in "time -f") - the wall time can be affected by
factors that don't immediately relate to hashing/sorting performance.

7.
In your script you are providing a file-based input,
it might be useful to clear the kernel's cache before each run
to ensure these do not affect the results (especially when dealing with 
large files).
See: 
https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

To conclude,
There is definitely merit in trying to optimize uniq's performance,
but it must be weighed against other factors (e.g. the ability to 'uniq'
a file larger than available memory, and the ease of using existing 
programs to achieve the same effect).

IMHO it is not a top-priority, so I don't want to give the impression
that if an amazing speed improvement is presented, uniq will be swiftly
rewritten.

But of course, a working patch and strong pro arguments would go a long
way towards incorporating this feature.

regards,
 - assaf

Information forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Wed, 11 Jul 2018 23:44:01 GMT) Full text and rfc822 format available.

Message #20 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 32099 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: datamash wins! (Was: New uniq option for speed)
Date: Wed, 11 Jul 2018 16:43:09 -0700

[Message part 1 (text/plain, inline)]

Hey Assaf,

Thank you very much for taking the time to share
your thoughts.

People like you, willing to improve free software
and help the common good, restore my faith in
humanity.


I followed much of your advice.

I ran more benchmarks: 

gawk, mawk, sort -u, sort | uniq and ... datamash rmdup!

Plus, I benchmarked them against a bigger file.

~1 GB!

Plus, plus, I measured how much memory they used!


The result?

Hashing algorithms always won.

I expected it.

When I benchmarked data compression apps, one
named "rzip" that searched its cache of unique
values with hashing was much faster.

    https://www.linuxjournal.com/article/8051


But, to be fair, I was surprised by how little
memory datamash seemed to use.

Maybe I screwed up.

I checked max resident set size of the process
during its lifetime with /usr/bin/time's "%M"
specifier.


Plus, plus, plus, I created a composite
performance criteria of 

    elapsed time * memory used

(It's in the attached spread sheet.)

That winner at ~1 GB?

datamash!

No matter whether the keys were unique or
duplicated, "datamash rmdup 1" won!

New files are attached.


Humble suggestion:

Don't worry about running out of memory.

Hashing could be added to the "uniq" command as...
(wait for it...) an option.

So you could always omit it and use sort.


For yours truly, the big gain is actually being
able to ditch sort.

Less is more.

It's a UNIX thing.


So it is said.

So it is done.

Sorting is great, but

less is number one!

~K

PS: Congratulations on datamash performing so
    well.

On 07/10/2018 12:21, Assaf Gordon wrote:
> tag 32099 notabug
> severity 32099 wishlist
> close 32099
> stop
> 
> Hello Kingsley,
> 
> On 09/07/18 10:29 PM, Kingsley G. Morse Jr. wrote:
> > And I like your benchmarks.
> > 
> > Mine are humbly attached.
> > 
> > They compare sorting to hashing.
> > 
> > At the moment, hashing seems to me to be about 70%
> > faster.
> > 
> > And to scale better.
> > 
> > I'd still like uniq to be faster and free from
> > sort.
> 
> Thank you for taking the time to create and provide these,
> including the scripts to reproduce it.
> 
> I've marked this item as "closed / wishlist" - but only because it
> is technically not a bug report. Discussion is welcomed to continue
> by replying to this thread.
> 
> Being "faster" alone is important but not the only criteria (as
> mentioned in the previous email's examples).
> 
> Your measurements are a good start towards stronger support to
> hash-based uniq implementation (a working patch, btw, would be even
> stronger).
> 
> I would suggest several improvements to the measurements, to ensure
> relevant information is conveyed:
> 
> 1.
> Specify exactly which versions of awk/sort you are using
> (e.g. which versions, whether from your distribution or compiled from
> source, if compiled than which compiler / optimization flags).
> What CPUs is being used, how much memory does the computer have, etc.
> 
> 2.
> I would suggest adding "sort | uniq" to the charts, because that is the
> real base-line you are trying to improve.
> I would also consider adding "datamash rmdup" to the charts, because its
> hash-based implementation could serve as a ball-park indication of how
> a naive hash implementation would work in uniq.
> 
> 3.
> You input is very small (up to 90K lines of 10 characters each) - that
> does not seem to stress-test any significant parts of the algorithms
> (and for a typical user, waiting 0.05 seconds vs 0.15 seconds does not
> matter).
> 
> I would recommend much larger input files, with much longer lines -
> trying to simulate various real-world scenarios.
> 
> Here's a project of someone who tried to compare various hashing
> functions: https://github.com/jhunt/hash/
> Under the "corpus" directory you'll find few demo data files.
> Even the largest one (qnames) is only 84MB - not very big.
> 
> Perhaps consider duplicating the content few times, then test it:
>   for i in $(seq 1 5);do cat qnames; done > input
> 
> To get even closer to real-world input, try several input files
> with different ratios of duplicated lines (your script create completely
> random data - not very similar to real-world input files).
> 
> For example, what happens with many lines share similar sub-strings?
> depending on the hashing implementation, this could lead to lower
> performance.
> 
> 4.
> It would be beneficial to see what's the memory requirements and limits
> of a hash-based implementation. Currently, because 'sort' is not limited
> by available RAM, there is no memory limit (though resorting to disk I/O
> might make everything slow).
> For example, is it possible to hash an 800M file on a machine with only 1GB
> of RAM ? and without additional code, hashing files larger than available
> memory is simply not possible - a very big disadvantage.
> 
> 5.
> To compare "sort' fairly against other options,
> I would add to the measurements using "--parallel 1" and later "--parallel
> N" (N being the number of cpu/cores you have).
> Does that improve performance?
> 
> The "awk" (or datamash) programs do not have a built-in memory limit,
> while sort does. It would be useful to specify a large amount of
> memory for sort (using "-S") to ensure sort did not revert to disk I/O.
> 
> 6.
> The reported time in your script is the elapsed time (aka "wall time").
> A more relevant option would be "user time" + "kernel time"
> (optiosn %S and %U in "time -f") - the wall time can be affected by
> factors that don't immediately relate to hashing/sorting performance.
> 
> 7.
> In your script you are providing a file-based input,
> it might be useful to clear the kernel's cache before each run
> to ensure these do not affect the results (especially when dealing with
> large files).
> See: https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
> 
> 
> 
> 
> To conclude,
> There is definitely merit in trying to optimize uniq's performance,
> but it must be weighed against other factors (e.g. the ability to 'uniq'
> a file larger than available memory, and the ease of using existing programs
> to achieve the same effect).
> 
> IMHO it is not a top-priority, so I don't want to give the impression
> that if an amazing speed improvement is presented, uniq will be swiftly
> rewritten.
> 
> But of course, a working patch and strong pro arguments would go a long
> way towards incorporating this feature.
> 
> regards,
>  - assaf
> 
> 
> 

-- 
Time is the fire in which we all burn.

[uniq_demo (text/plain, attachment)]

[hash_and_sort_benchmarks.2.gnumeric (application/gnumeric, attachment)]

[uniq_demo.log.2 (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Thu, 12 Jul 2018 06:45:02 GMT) Full text and rfc822 format available.

Message #23 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Assaf Gordon <assafgordon <at> gmail.com>,
 "Kingsley G. Morse Jr." <kingsley <at> loaner.com>
Cc: 32099 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option
 for speed)
Date: Thu, 12 Jul 2018 08:43:59 +0200

On 07/10/2018 08:21 PM, Assaf Gordon wrote:
> I would suggest several improvements to the measurements, to ensure
> relevant information is conveyed:
> 
> 1.
> [...]

great summary.  With the risk of mentioning already said aspects:

7. Consider all the existing options, i.e. processing modes, of 'uniq' as well.
This means, it has to be considered (upfront!) whether introducing an alternate
way to do things - in this case hashing - only serves for the trivial case,
or whether this would slow down or even contradict the processing with other
options, currently:

  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines, one for each group
  -D                    print all duplicate lines
      --all-repeated[=METHOD]  like -D, but allow separating groups
                                 with an empty line;
                                 METHOD={none(default),prepend,separate}
  -f, --skip-fields=N   avoid comparing the first N fields
      --group[=METHOD]  show all items, separating groups with an empty line;
                          METHOD={separate(default),prepend,append,both}
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique          only print unique lines
  -z, --zero-terminated     line delimiter is NUL, not newline
  -w, --check-chars=N   compare no more than N characters in lines

Without deeper thinking about it, especially the combinations with
the --group, --repeated and --unique options might be problematic.

8. Standards.  POSIX says:
  http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html

  The uniq utility shall read an input file comparing adjacent lines,
  and write one copy of each input line on the output. The second and
  succeeding copies of repeated adjacent input lines shall not be written.

This makes an assumption both on the input and output.

Regarding the input, this means that the processing just has to remember
the previous line in order to decide whether to print/discard it in the
output.  Therefore, arbitrary input size is fine.
Hashing most probably will have issues with arbitrary input size;
I do not talk about 1GB files, but _much_ larger files (yes, we
are in 2018 where 1GB is like nothing).

Regarding the output, this means that the output is implicitly in
sort order as well.  Like the input, uniq can discard the already
written data from memory because it is sure that uniq doesn't need
to consider it anymore.

Thus said, a successful optimization idea does not only have to show
that it is faster or needs less resources in _some_ cases, but also
must prove that it does not tear down many cases including extreme
ones.  The current implementation might be as-is for a good reason.
If it turns out that the optimization idea screws up a single
of the above use cases, then the dilemma is whether 2 implementations
are warranted to be kept (maintenance!), and whether it is possible
to detect the extreme cases early enough to switch from the default
to the other processing strategy.

https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good
or D. Knuth:
  "premature optimization is the root of all evil
   (or at least most of it) in programming"

As Assaf said, a patch with a proof of concept would be helpful ... even
if it's just helpful to proof that the current implementation is fine.

Have a nice day,
Berny

Information forwarded to bug-coreutils <at> gnu.org:
bug#32099; Package coreutils. (Sat, 14 Jul 2018 03:04:02 GMT) Full text and rfc822 format available.

Message #26 received at 32099 <at> debbugs.gnu.org (full text, mbox):

From: "Kingsley G. Morse Jr." <kingsley <at> loaner.com>
To: Bernhard Voelker <mail <at> bernhard-voelker.de>
Cc: 32099 <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Testing with other options (Was: Benchmarks: Hashing ~70% faster )
Date: Fri, 13 Jul 2018 20:03:01 -0700

Hey Berny,

I like your suggestion of testing whether hashing
interferes with any other option.

I was glad to see the POSIX standard doesn't
explicitly require sorted input or output.

If someone writes the patch, I should be able to
test it, at least on up to 1 GB of input.

So,
Kingsley

On 07/12/2018 08:43, Bernhard Voelker wrote:
> On 07/10/2018 08:21 PM, Assaf Gordon wrote:
> > I would suggest several improvements to the measurements, to ensure
> > relevant information is conveyed:
> > 
> > 1.
> > [...]
> 
> great summary.  With the risk of mentioning already said aspects:
> 
> 7. Consider all the existing options, i.e. processing modes, of 'uniq' as well.
> This means, it has to be considered (upfront!) whether introducing an alternate
> way to do things - in this case hashing - only serves for the trivial case,
> or whether this would slow down or even contradict the processing with other
> options, currently:
> 
>   -c, --count           prefix lines by the number of occurrences
>   -d, --repeated        only print duplicate lines, one for each group
>   -D                    print all duplicate lines
>       --all-repeated[=METHOD]  like -D, but allow separating groups
>                                  with an empty line;
>                                  METHOD={none(default),prepend,separate}
>   -f, --skip-fields=N   avoid comparing the first N fields
>       --group[=METHOD]  show all items, separating groups with an empty line;
>                           METHOD={separate(default),prepend,append,both}
>   -i, --ignore-case     ignore differences in case when comparing
>   -s, --skip-chars=N    avoid comparing the first N characters
>   -u, --unique          only print unique lines
>   -z, --zero-terminated     line delimiter is NUL, not newline
>   -w, --check-chars=N   compare no more than N characters in lines
> 
> Without deeper thinking about it, especially the combinations with
> the --group, --repeated and --unique options might be problematic.
> 
> 8. Standards.  POSIX says:
>   http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html
> 
>   The uniq utility shall read an input file comparing adjacent lines,
>   and write one copy of each input line on the output. The second and
>   succeeding copies of repeated adjacent input lines shall not be written.
> 
> This makes an assumption both on the input and output.
> 
> Regarding the input, this means that the processing just has to remember
> the previous line in order to decide whether to print/discard it in the
> output.  Therefore, arbitrary input size is fine.
> Hashing most probably will have issues with arbitrary input size;
> I do not talk about 1GB files, but _much_ larger files (yes, we
> are in 2018 where 1GB is like nothing).
> 
> Regarding the output, this means that the output is implicitly in
> sort order as well.  Like the input, uniq can discard the already
> written data from memory because it is sure that uniq doesn't need
> to consider it anymore.
> 
> 
> Thus said, a successful optimization idea does not only have to show
> that it is faster or needs less resources in _some_ cases, but also
> must prove that it does not tear down many cases including extreme
> ones.  The current implementation might be as-is for a good reason.
> If it turns out that the optimization idea screws up a single
> of the above use cases, then the dilemma is whether 2 implementations
> are warranted to be kept (maintenance!), and whether it is possible
> to detect the extreme cases early enough to switch from the default
> to the other processing strategy.
> 
> https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good
> or D. Knuth:
>   "premature optimization is the root of all evil
>    (or at least most of it) in programming"
> 
> As Assaf said, a patch with a proof of concept would be helpful ... even
> if it's just helpful to proof that the current implementation is fine.
> 
> Have a nice day,
> Berny

-- 
Time is the fire in which we all burn.

Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:32:02 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:32:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 32099 <at> debbugs.gnu.org and "Kingsley G. Morse Jr." <kingsley <at> loaner.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:32:02 GMT) Full text and rfc822 format available.

Forcibly Merged 32099 32101. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:33:01 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 09:10:01 GMT) Full text and rfc822 format available.

Changed bug title to 'uniq: add hash-based implementation' from 'New uniq option for speed' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 09:10:02 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 312 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #32099 uniq: add hash-based implementation

GNU bug report logs - #32099
uniq: add hash-based implementation