GNU bug report logs - #16966
[PATCH] grep: optimization with the superset of DFA

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 8 Mar 2014 05:43:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 22:02:01 +0900
Paolo Bonzini wrote:
> Does anything change if there are a few million c's?

The superset of `a ANYCHAR b' is 'a CSET STAR b'.
It's DFA states are following.

s0: The position set is none. 
s1: The position set is 1:a 
s2: The position set is 1:a 2:CSET
s3: The position set is 1:a 2:CSET 3:b (accepted)

        1  accccccccccccccccccccccccccccccccccccccc
           ^ s1

        1  accccccccccccccccccccccccccccccccccccccc
            ^ s2

        1  accccccccccccccccccccccccccccccccccccccc
             ^ s2

              ......

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                 ^ s2

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                  ^ s3 accepted

Then re-searched with the superset because matched on multi-lines.

1,000,000  cccccccccccccccccccccacccccccccccccccccb
           ^ s0


1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                  ^ s3 accepted

Then re-searched with the original DFA becuase matched on one-line.
The DFA states are following.

s0: The position set is none. 
s1: The position set is 1:a 
s2: The position set is 1:a 2:ANYCHAR
s3: The position set is 1:a 2:ANYCHAR 3:b (accepted)

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                  ^ s0 rejected



Even if matched multi-line, It's still as fast and memory-required as
matched single-line, because in search-mode STAR doesn't count number
of NFA and DFA states up.  So for example, a set of DFA states for
`a CSET STAR b' is correspond with for `a CSET b'. (If my recognition is
right.)

OTOH, If CSET{1,mb_cur_max} is assigned for ANYCHAR, number of DFA
states will be greater than above, and I may be slightly slower
becuase of overheads to build DFA states.  Notice that `{m,n}'
expression requires a lot of memory and is slow (See also
http://www.gnu.org/software/grep/manual/grep.html#Reporting-Bugs).

Example, use the superset for following pattern in EUC-JP locale.
It's mb_cur_max == 3.

    a.....b

If ANYCHAR is replaced to `CSET STAR', as following.

    a CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR b

I draw it below.

        /\
       /  \  2:CSET
       \  /
        \/
       1:a ---> 3:b
(The figure drawn previously was wrong.)

It's equal to following.

    a CSET STAR b

OTOH, the period is replaced to `CSET CSET CSET CAT OR CSET CSET CAT
CSET CAT OR', below.  Itt's complicated, and I don't want to draw it. :)

    a CSET CSET CSET CAT OR CSET CSET CAT CSET CAT OR CSET CSET CSET
    CAT OR CSET CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET
    CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET CSET CAT CSET
    CAT OR CAT b

Thanks,
Norihiro





This bug report was last modified 11 years and 107 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.