GNU bug report logs - #19358
[PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Fri, 12 Dec 2014 15:59:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: bug#19358: closed (Re: grep: use Aho-Corasick algorithm to search
 multiple fixed words)
Date: Thu, 02 Jun 2016 22:46:01 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#19358: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 19358 <at> debbugs.gnu.org.

-- 
19358: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19358
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 19358-done <at> debbugs.gnu.org
Subject: Re: grep: use Aho-Corasick algorithm to search multiple fixed words
Date: Thu, 2 Jun 2016 15:45:27 -0700
[Message part 3 (text/plain, inline)]
Sorry that patch took so long to review. I installed it, along with the 
attached followup patches which are mostly just minor style things (plus 
fixing the attribution for a patch that I forgot to specify --author for).

I didn't get as much performance improvement on my platform, so I toned 
down the NEWS item a bit. Still, wow. It is a 2.5x performance 
improvement for that test case, and it's asymptotically better. Thanks.

[0001-grep-minor-cleanups-for-F-Aho-Corasick.patch (application/x-patch, attachment)]
[0002-grep-simplify-F-Aho-Corasick-a-bit.patch (application/x-patch, attachment)]
[0003-maint-correct-attribution.patch (application/x-patch, attachment)]
[Message part 7 (message/rfc822, inline)]
From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed
 words
Date: Sat, 13 Dec 2014 00:58:02 +0900
[Message part 8 (text/plain, inline)]
Hi,

Searching multiple fixed words, grep uses Commentz-Walter algorithm, but
it is very slow for a worst case e.g. as following.  It is O(m*n).

  - input: yes `printf %040d` | head -10000000
  - word1: x0000000000000000000
  - word2: x

This change uses Aho-Corasick algorithm instead of Commentz-Walter
algorithm to search multiple fixed words.  It uses a function to build
tries which has been already defined for Commentz-Walter algorithm in
kwset.c and which has been already high quality.

I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
by this change.

First without this change (best-of-5 trials):

    find /usr/share/doc/ -type f |
    LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
        real 11.37      user 11.03      sys 0.24

Next with this change (best-of-5 trials):

    find /usr/share/doc/ -type f |
    LC_ALL=C  time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
        real 1.49       user 1.31       sys 0.15

I also wrote two additional patches.

Second, If search multiple fixed words, grep immediately returns without
longest match if not needed.  Without this change, grep tries longest
match for multiple words even if not needed.

Third, use memchr2 for two patterns of a character.

Thanks,
Norihiro
[0001-grep-use-Aho-Corasick-algorithm-to-search-multiple-f.patch (text/plain, attachment)]
[0002-grep-immediately-return-without-longest-match-to-sea.patch (text/plain, attachment)]
[0003-use-memchr2-for-two-patterns-of-a-character.patch (text/plain, attachment)]

This bug report was last modified 8 years and 352 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.