#16966 - [PATCH] grep: optimization with the superset of DFA

GNU bug report logs - #16966
[PATCH] grep: optimization with the superset of DFA

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 8 Mar 2014 05:43:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp> To: Paolo Bonzini <bonzini <at> gnu.org>, 16966 <at> debbugs.gnu.org Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA Date: Fri, 04 Apr 2014 12:28:47 +0900

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote: > s0: The position set is none. > s1: The position set is 1:a > s2: The position set is 1:a 2:CSET > s3: The position set is 2:CSET 3:b (accepted) > s4: The position set is 2:CSET Sorry, it was wrong yet. It should be as follows. s0: The position set is none. s1: The position set is 1:a 2:CSET 3:b s2: The position set is 1:a 2:CSET 3:b (accepted) First of all, I noticed that the example is wrong, because if fixed strings are included in the pattern, kwset is used for the pattern, and dfahint() is called for each single line. -- By the way, I compared two cases, which are CSET* and CSET{1,mb_cur_max}. I used `$a\|z$[x-y]\{10\}$b\|z$' as the pattern. `[x-z]' is MBCSET in UTF-8 locale. If replace MBCSET to CSET*, the pattern in the superset is `$a\|z$$CSET*$\{10\}$b\|z$'. If replace MBCSET to CSET{1,mb_cur_max}, the pattern in the superset is `$a\|z$$CSET\{1,6\}$\{10\}$b\|z$'. In order to simplify, use `$a\|z$$.\{1,6\}$\{10\}$b\|z$' in POSIX locale for later. Before testing, apply the patch attached on this mail, Set `DDEBUG' to CPPFLAGS, and re-build. It outputs more debugging information. $ yes `printf '%0100d\n'` | head -320 | \ sed -e '1s/^0/a/; $s/0$/b/' | \ env LANG=en_US.UTF-8 src/grep '$a\|z$[x-y]\{10\}$b\|z$' 2>debug1.log $ yes `printf '%0100d\n'` | head -320 | \ sed -e '1s/^0/a/; $s/0$/b/' | \ env LANG=C src/grep '$a\|z$$.\{1,6\}$\{10\}$b\|z$' 2>debug2.log (`320' originates in the default size of a buffer.) The formar is only built 3 states. OTOH, the later is built 224 dfastates. i.e. even if matched multi-lines, many states aren't necessarily built. Next, I checked performance for each case in worst text. It's best in 10 times. yes `printf '%0100d\n'` | head -2 | sed -e '1s/^0/a/; $s/0$/b/' >t yes t | head -1000000 | xargs cat >u time -p env LANG=C src/grep '$a\|z$$.\{1,6\}$\{10\}$b\|z$' u real 1.26 user 0.95 sys 0.28 time -p env LANG=en_US.UTF-8 src/grep '$a\|z$[x-y]\{10\}$b\|z$' u real 0.74 user 0.45 sys 0.27 Though, the later is fast, I like the former, because it's very simple. the later is complicated, also needs to update nleaves and depth. Norihiro

[patch.txt (text/plain, attachment)]

This bug report was last modified 11 years and 108 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16966 [PATCH] grep: optimization with the superset of DFA

GNU bug report logs - #16966
[PATCH] grep: optimization with the superset of DFA