GNU bug report logs -
#16966
[PATCH] grep: optimization with the superset of DFA
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Sat, 8 Mar 2014 05:43:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #57 received at 16966 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> s0: The position set is none.
> s1: The position set is 1:a
> s2: The position set is 1:a 2:CSET
> s3: The position set is 2:CSET 3:b (accepted)
> s4: The position set is 2:CSET
Sorry, it was wrong yet. It should be as follows.
s0: The position set is none.
s1: The position set is 1:a 2:CSET 3:b
s2: The position set is 1:a 2:CSET 3:b (accepted)
First of all, I noticed that the example is wrong, because if fixed
strings are included in the pattern, kwset is used for the pattern,
and dfahint() is called for each single line.
--
By the way, I compared two cases, which are CSET* and
CSET{1,mb_cur_max}. I used `\(a\|z\)[x-y]\{10\}\(b\|z\)' as the pattern.
`[x-z]' is MBCSET in UTF-8 locale.
If replace MBCSET to CSET*, the pattern in the superset is
`\(a\|z\)\(CSET*\)\{10\}\(b\|z\)'.
If replace MBCSET to CSET{1,mb_cur_max}, the pattern in the superset is
`\(a\|z\)\(CSET\{1,6\}\)\{10\}\(b\|z\)'.
In order to simplify, use `\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' in
POSIX locale for later.
Before testing, apply the patch attached on this mail, Set `DDEBUG' to
CPPFLAGS, and re-build. It outputs more debugging information.
$ yes `printf '%0100d\n'` | head -320 | \
sed -e '1s/^0/a/; $s/0$/b/' | \
env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' 2>debug1.log
$ yes `printf '%0100d\n'` | head -320 | \
sed -e '1s/^0/a/; $s/0$/b/' | \
env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' 2>debug2.log
(`320' originates in the default size of a buffer.)
The formar is only built 3 states.
OTOH, the later is built 224 dfastates.
i.e. even if matched multi-lines, many states aren't necessarily built.
Next, I checked performance for each case in worst text.
It's best in 10 times.
yes `printf '%0100d\n'` | head -2 | sed -e '1s/^0/a/; $s/0$/b/' >t
yes t | head -1000000 | xargs cat >u
time -p env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' u
real 1.26 user 0.95 sys 0.28
time -p env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' u
real 0.74 user 0.45 sys 0.27
Though, the later is fast, I like the former, because it's very simple.
the later is complicated, also needs to update nleaves and depth.
Norihiro
[patch.txt (text/plain, attachment)]
This bug report was last modified 11 years and 107 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.