GNU bug report logs -
#16966
[PATCH] grep: optimization with the superset of DFA
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Sat, 8 Mar 2014 05:43:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Sun, 06 Apr 2014 22:08:53 -0700
with message-id <534232E5.7090408 <at> cs.ucla.edu>
and subject line Re: bug#16966: [PATCH] grep: optimization with the superset of DFA
has caused the debbugs.gnu.org bug report #16966,
regarding [PATCH] grep: optimization with the superset of DFA
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
16966: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16966
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
[Message part 3 (text/plain, inline)]
Package: grep
Tags: patch
DFA may be build the superset of itself, which is the same as the itself
expect ANYCHAR, MBCSET and BACKREF are replaced CSET set full bits
followed by STAR, and mb_cur_max is equal to 1, by the patch.
For example, if given the pattern `a\(b\)c\1', the tokens of original
DFA and the superset is below.
original: a b CAT c CAT BACKREF CAT
superset: a b CAT c CAT CSET STAR CAT
(Full bits are set to CSET.)
If a string doesn't matches the superset of DFA for a pattern, the
string will also never match original DFA. By the way, matching with
the superset is very fast because it never has ANYCHAR, MBCSET and
BACKREF, which are very expensive, and its mb_cur_max is always equal
to 1. Therefore, perfomance for matching with DFA may be able to be
dramatically improved without overhead with the superset.
I prepare following string to measure the performance.
yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
I run three tests with this patch (best-of-5 trials):
env LC_ALL=ja_JP.eucJP time -p src/grep -i foobar k
real 1.77 user 1.23 sys 0.47
env LC_ALL=en_US.UTF-8 time -p src/grep -i 'j[a-c]d' k
real 1.86 user 1.35 sys 0.45
env LC_ALL=C time -p src/grep '\(j\)\1d' k
real 1.92 user 1.40 sys 0.48
Back out that commit (temporarily), recompile and rerun the experiment:
env LC_ALL=ja_JP.eucJP time -p src/grep -i foobar k
real 27.21 user 21.15 sys 5.36
env LC_ALL=en_US.UTF-8 time -p src/grep -i 'j[a-c]d' k
real 96.35 user 429.80 sys 57.37
78.57user 15.16system 1:36.35elapsed 97%CPU (0avgtext+0avgdata 3296maxresident)k
env LC_ALL=C time -p src/grep '\(j\)\1d' k
real 502.32 user 429.80 sys 57.37
Norihiro
[patch.txt (application/octet-stream, attachment)]
[Message part 5 (message/rfc822, inline)]
[Message part 6 (text/plain, inline)]
Norihiro Tanaka wrote:
> I rebased this patch, and added four fixes to it.
Thanks. This patch seems like a real performance win for grep -i in the
usual case, so I installed it into the master. If we run into
performance problems in unusual cases I suppose we can look into them as
they arise.
In reviewing the patch I found one memory-access typo:
> - if ((end = memchr(beg, eol, buflim - beg)) != NULL)
> ...
> + if ((end = memchr(next_beg, eol, buflim - beg)) != NULL)
That last "beg" should be "next_beg".
I merged the patch into the current master on Savannah (first attached
file) and then applied a fixup patch (second attached file) that fixes
this bug and makes some other minor related improvements, e.g., using
memrchr rather than looking for eol by hand.
[0001-grep-optimization-with-the-superset-of-DFA.patch (text/plain, attachment)]
[0002-grep-cleanup-DFA-superset-optimization.patch (text/plain, attachment)]
This bug report was last modified 11 years and 107 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.