GNU bug report logs - #16966
[PATCH] grep: optimization with the superset of DFA

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 8 Mar 2014 05:43:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>, 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Fri, 04 Apr 2014 12:28:47 +0900
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> s0: The position set is none.
> s1: The position set is 1:a
> s2: The position set is 1:a 2:CSET
> s3: The position set is 2:CSET 3:b (accepted)
> s4: The position set is 2:CSET

Sorry, it was wrong yet.  It should be as follows.

s0: The position set is none.
s1: The position set is 1:a 2:CSET 3:b
s2: The position set is 1:a 2:CSET 3:b (accepted)

First of all, I noticed that the example is wrong, because if fixed
strings are included in the pattern, kwset is used for the pattern,
and dfahint() is called for each single line.

--
By the way, I compared two cases, which are CSET* and
CSET{1,mb_cur_max}.  I used `\(a\|z\)[x-y]\{10\}\(b\|z\)' as the pattern.
`[x-z]' is MBCSET in UTF-8 locale.

If replace MBCSET to CSET*, the pattern in the superset is
`\(a\|z\)\(CSET*\)\{10\}\(b\|z\)'.
If replace MBCSET to CSET{1,mb_cur_max}, the pattern in the superset is
`\(a\|z\)\(CSET\{1,6\}\)\{10\}\(b\|z\)'.

In order to simplify, use `\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' in
POSIX locale for later.

Before testing, apply the patch attached on this mail, Set `DDEBUG' to
CPPFLAGS, and re-build.  It outputs more debugging information.

  $ yes `printf '%0100d\n'` | head -320 | \
      sed -e '1s/^0/a/; $s/0$/b/' | \
      env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' 2>debug1.log
  $ yes `printf '%0100d\n'` | head -320 | \
      sed -e '1s/^0/a/; $s/0$/b/' | \
      env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' 2>debug2.log

(`320' originates in the default size of a buffer.)

The formar is only built 3 states.
OTOH, the later is built 224 dfastates.
i.e. even if matched multi-lines, many states aren't necessarily built. 

Next, I checked performance for each case in worst text.
It's best in 10 times.

yes `printf '%0100d\n'` | head -2 | sed -e '1s/^0/a/; $s/0$/b/' >t
yes t | head -1000000 | xargs cat >u

time -p env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' u
    real 1.26       user 0.95       sys 0.28
time -p env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' u
    real 0.74       user 0.45       sys 0.27

Though, the later is fast, I like the former, because it's very simple.
the later is complicated, also needs to update nleaves and depth.

Norihiro
[patch.txt (text/plain, attachment)]

This bug report was last modified 11 years and 108 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.