GNU bug report logs -
#19358
[PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Fri, 12 Dec 2014 15:59:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19358 in the body.
You can then email your comments to 19358 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#19358
; Package
grep
.
(Fri, 12 Dec 2014 15:59:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Fri, 12 Dec 2014 15:59:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
Searching multiple fixed words, grep uses Commentz-Walter algorithm, but
it is very slow for a worst case e.g. as following. It is O(m*n).
- input: yes `printf %040d` | head -10000000
- word1: x0000000000000000000
- word2: x
This change uses Aho-Corasick algorithm instead of Commentz-Walter
algorithm to search multiple fixed words. It uses a function to build
tries which has been already defined for Commentz-Walter algorithm in
kwset.c and which has been already high quality.
I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
by this change.
First without this change (best-of-5 trials):
find /usr/share/doc/ -type f |
LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
real 11.37 user 11.03 sys 0.24
Next with this change (best-of-5 trials):
find /usr/share/doc/ -type f |
LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
real 1.49 user 1.31 sys 0.15
I also wrote two additional patches.
Second, If search multiple fixed words, grep immediately returns without
longest match if not needed. Without this change, grep tries longest
match for multiple words even if not needed.
Third, use memchr2 for two patterns of a character.
Thanks,
Norihiro
[0001-grep-use-Aho-Corasick-algorithm-to-search-multiple-f.patch (text/plain, attachment)]
[0002-grep-immediately-return-without-longest-match-to-sea.patch (text/plain, attachment)]
[0003-use-memchr2-for-two-patterns-of-a-character.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#19358
; Package
grep
.
(Thu, 06 Aug 2015 04:13:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 19358 <at> debbugs.gnu.org (full text, mbox):
On Fri, Dec 12, 2014 at 7:58 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Hi,
>
> Searching multiple fixed words, grep uses Commentz-Walter algorithm, but
> it is very slow for a worst case e.g. as following. It is O(m*n).
>
> - input: yes `printf %040d` | head -10000000
> - word1: x0000000000000000000
> - word2: x
>
> This change uses Aho-Corasick algorithm instead of Commentz-Walter
> algorithm to search multiple fixed words. It uses a function to build
> tries which has been already defined for Commentz-Walter algorithm in
> kwset.c and which has been already high quality.
>
> I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
> by this change.
>
> First without this change (best-of-5 trials):
>
> find /usr/share/doc/ -type f |
> LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
> real 11.37 user 11.03 sys 0.24
>
> Next with this change (best-of-5 trials):
>
> find /usr/share/doc/ -type f |
> LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
> real 1.49 user 1.31 sys 0.15
Nice patch. I see a similar performance improvement.
> I also wrote two additional patches.
>
> Second, If search multiple fixed words, grep immediately returns without
> longest match if not needed. Without this change, grep tries longest
> match for multiple words even if not needed.
That sounds like a very good idea and the change looks correct.
> Third, use memchr2 for two patterns of a character.
This sounds useful, but I discovered that the new code is not
triggered even once by any of our tests. As such, I suspect that
it is not justified to add these new conditionals in code that is
often inlined. Do you have experiments that demonstrate how
that final patch helps?
Thank you,
Jim
Information forwarded
to
bug-grep <at> gnu.org
:
bug#19358
; Package
grep
.
(Thu, 06 Aug 2015 13:11:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 19358 <at> debbugs.gnu.org (full text, mbox):
On Wed, 5 Aug 2015 21:12:33 -0700
Jim Meyering <jim <at> meyering.net> wrote:
> This sounds useful, but I discovered that the new code is not
> triggered even once by any of our tests. As such, I suspect that
> it is not justified to add these new conditionals in code that is
> often inlined. Do you have experiments that demonstrate how
> that final patch helps?
Thanks for reviewing. It is effective in only special case. Could you
try following test for not patched and patched?
$ printf 'a\nb\n' >pat
$ yes $(printf %040d 0) | head -10000000 >in
$ time -p src/grep -Ff pat in
In my machine, patched version is about 10% faster than not patched.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Thu, 02 Jun 2016 22:46:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Thu, 02 Jun 2016 22:46:01 GMT)
Full text and
rfc822 format available.
Message #16 received at 19358-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Sorry that patch took so long to review. I installed it, along with the
attached followup patches which are mostly just minor style things (plus
fixing the attribution for a patch that I forgot to specify --author for).
I didn't get as much performance improvement on my platform, so I toned
down the NEWS item a bit. Still, wow. It is a 2.5x performance
improvement for that test case, and it's asymptotically better. Thanks.
[0001-grep-minor-cleanups-for-F-Aho-Corasick.patch (application/x-patch, attachment)]
[0002-grep-simplify-F-Aho-Corasick-a-bit.patch (application/x-patch, attachment)]
[0003-maint-correct-attribution.patch (application/x-patch, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 01 Jul 2016 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 351 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.