GNU bug report logs - #16966
[PATCH] grep: optimization with the superset of DFA

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 8 Mar 2014 05:43:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16966 in the body.
You can then email your comments to 16966 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Sat, 08 Mar 2014 05:43:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 08 Mar 2014 05:43:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: submit <at> debbugs.gnu.org
Subject: [PATCH] grep: optimization with the superset of DFA
Date: Sat, 08 Mar 2014 14:41:42 +0900
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch

DFA may be build the superset of itself, which is the same as the itself
expect ANYCHAR, MBCSET and BACKREF are replaced CSET set full bits
followed by STAR, and mb_cur_max is equal to 1, by the patch.

For example, if given the pattern `a\(b\)c\1', the tokens of original
DFA and the superset is below.

  original: a b CAT c CAT BACKREF CAT

  superset: a b CAT c CAT CSET STAR CAT
            (Full bits are set to CSET.)

If a string doesn't matches the superset of DFA for a pattern, the
string will also never match original DFA.  By the way, matching with
the superset is very fast because it never has ANYCHAR, MBCSET and
BACKREF, which are very expensive, and its mb_cur_max is always equal
to 1.  Therefore, perfomance for matching with DFA may be able to be
dramatically improved without overhead with the superset.

I prepare following string to measure the performance.

    yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k

I run three tests with this patch (best-of-5 trials):

    env LC_ALL=ja_JP.eucJP time -p src/grep -i foobar k
        real 1.77       user 1.23       sys 0.47
    env LC_ALL=en_US.UTF-8 time -p src/grep -i 'j[a-c]d' k
        real 1.86       user 1.35       sys 0.45
    env LC_ALL=C time -p src/grep '\(j\)\1d' k
        real 1.92       user 1.40       sys 0.48

Back out that commit (temporarily), recompile and rerun the experiment:

    env LC_ALL=ja_JP.eucJP time -p src/grep -i foobar k
        real 27.21      user 21.15      sys 5.36
    env LC_ALL=en_US.UTF-8 time -p src/grep -i 'j[a-c]d' k
        real 96.35      user 429.80     sys 57.37
        78.57user 15.16system 1:36.35elapsed 97%CPU (0avgtext+0avgdata 3296maxresident)k
    env LC_ALL=C time -p src/grep '\(j\)\1d' k
        real 502.32     user 429.80     sys 57.37

Norihiro
[patch.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Sat, 08 Mar 2014 07:10:01 GMT) Full text and rfc822 format available.

Message #8 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 16966 <at> debbugs.gnu.org
Subject: Re: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Sat, 08 Mar 2014 16:09:34 +0900
[Message part 1 (text/plain, inline)]
I fixed the bug which doesn't QMARK and PLUS in dfasuperset() and
modified serveral comments.
[patch.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Sun, 09 Mar 2014 23:52:14 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
Extra info received and forwarded to list. Copy sent to bug-grep <at> gnu.org. (Sun, 09 Mar 2014 23:52:17 GMT) Full text and rfc822 format available.

Message #13 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Sun, 09 Mar 2014 21:50:54 +0900
[Message part 1 (text/plain, inline)]
Sorry, the patch still had bugs.  I fixed them.  I confirmed that the
patched version passed all regression tests.
[patch.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Sun, 09 Mar 2014 23:52:18 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
Extra info received and forwarded to list. Copy sent to bug-grep <at> gnu.org. (Sun, 09 Mar 2014 23:52:19 GMT) Full text and rfc822 format available.

Message #18 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Mon, 10 Mar 2014 00:33:06 +0900
[Message part 1 (text/plain, inline)]
Sorry, the patch still had bugs.  I fixed them.  I confirmed that the
patched version passed all regression tests.
[patch.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Fri, 28 Mar 2014 17:21:02 GMT) Full text and rfc822 format available.

Message #21 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Sat, 29 Mar 2014 02:20:28 +0900
[Message part 1 (text/plain, inline)]
I rebased this patch, and added four fixes to it.

 1. Fix for the conditions that the superset is used.  No longer use it
    when don't include any normal chars and CSETs. (dfasuperset)

 2. Ignore any letter constrations.  Otherwise, it mayn't be able to be
    a superset of the original dfa. (dfasuperset)

 3. Change return type of dfahint().  It can check whether used or not
    from caller.(dfahint)

 4. If both kwset and dfahint() aren't used, run DFA matcher in whole
    range still.

Norihiro
[patch.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 08:17:02 GMT) Full text and rfc822 format available.

Message #24 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 16966 <at> debbugs.gnu.org
Subject: Re: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Tue, 01 Apr 2014 10:16:32 +0200
Il 28/03/2014 18:20, Norihiro Tanaka ha scritto:
> I rebased this patch, and added four fixes to it.
>
>  1. Fix for the conditions that the superset is used.  No longer use it
>     when don't include any normal chars and CSETs. (dfasuperset)
>
>  2. Ignore any letter constrations.  Otherwise, it mayn't be able to be
>     a superset of the original dfa. (dfasuperset)
>
>  3. Change return type of dfahint().  It can check whether used or not
>     from caller.(dfahint)
>
>  4. If both kwset and dfahint() aren't used, run DFA matcher in whole
>     range still.

For ANYCHAR, you can convert it to CSET{1,mb_cur_max} or, even better, 
(single-CSET | lead-CSET full-CSET{0,mb_cur_max-1}).

Single-CSET and lead-CSET can be computed by looping over the 256 
characters with mbrtowc and looking respectively for non-negative or -2 
return values.

Paolo




Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 15:20:03 GMT) Full text and rfc822 format available.

Message #27 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 00:18:52 +0900
Hi Paulo,

> For ANYCHAR, you can convert it to CSET{1,mb_cur_max} or, even better, (single-CSET | lead-CSET full-CSET{0,mb_cur_max-1}).

I seem that it's complicated.  The superset requires a memory area that
is different from the original DFA and additional costs to build it.  And
exact matching isn't required for it.  So, I want to make it simple and
smaller DFA.

Do you know how to code ANYCHAR correctly in a simple method? 

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 15:33:02 GMT) Full text and rfc822 format available.

Message #30 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16966 <at> debbugs.gnu.org
Subject: Re: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Tue, 01 Apr 2014 17:32:16 +0200
Il 01/04/2014 17:18, Norihiro Tanaka ha scritto:
>> > For ANYCHAR, you can convert it to CSET{1,mb_cur_max} or, even better, (single-CSET | lead-CSET full-CSET{0,mb_cur_max-1}).
> I seem that it's complicated.  The superset requires a memory area that
> is different from the original DFA and additional costs to build it.  And
> exact matching isn't required for it.  So, I want to make it simple and
> smaller DFA.

I'm worried that the "STAR" method will match basically everything. 
We're using something like CSET{1,mb_cur_max} already for UTF-8, so the 
size increase for that should not be too bad.

Paolo




Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 15:50:02 GMT) Full text and rfc822 format available.

Message #33 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 00:49:19 +0900
Hi Paolo,

> I'm worried that the "STAR" method will match basically everything.

If no normal char and/or CSET is included in the pattern, the superset
won't be used.

> We're using something like CSET{1,mb_cur_max} already for UTF-8, so the size increase for that should not be too bad.

We cannot apply its rule to non-UTF8 locales.

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 17:49:01 GMT) Full text and rfc822 format available.

Message #36 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16966 <at> debbugs.gnu.org
Subject: Re: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Tue, 01 Apr 2014 19:48:05 +0200
Il 01/04/2014 17:49, Norihiro Tanaka ha scritto:
>
>> > I'm worried that the "STAR" method will match basically everything.
> If no normal char and/or CSET is included in the pattern, the superset
> won't be used.
>

Yeah, but my problem is that a.b will look at a very long line if it is 
translated to a[\x0-\xff]*b.  Better translate it to a[\x0-\xff]{1,2}b 
or something similar.

Paolo




Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 22:22:01 GMT) Full text and rfc822 format available.

Message #39 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 07:21:02 +0900
Paolo Bonzini wrote:
> Yeah, but my problem is that a.b will look at a very long line if it
> is translated to a[\x0-\xff]*b.  Better translate it to a[\x0-\xff]{1,2}b
> or something similar.

I seem that it's no problem.

For example, I try following text for the pattern `a.b'.  Whereas the
digit isn't a part of the text but the line number.

1 accccccccccccccccccccccccccccccccccccccc
2 cccccccccccccccccccccccccccccccccccccccc
3 cccccccccccccccccccccccccccccccccccccccc
4 cccccccccccccccccccccccccccccccccccccccc
5 cccccccccccccccccccccccccccccccccccccccb

a --> a + CSET --> a + CSET --> ...... --> a + CSET + b (match)

Then all lines are matched fast.  However, because matches multiple
lines, retry from the last line (line 5).

It doesn't matches the pattern `a.b'.  Therefore the text is rejected.

Next, I try following text for it.

1 accccccccccccccccccccccccccccccccccccccc
2 cccccccccccccccccccccccccccccccccccccccc
3 cccccccccccccccccccccccccccccccccccccccc
4 cccccccccccccccccccccccccccccccccccccccc
5 accccccccccccccccccccccccccccccccccccccb

Then all lines are matched fast.  However, because matches multiple
lines, retry from the last line (line 5).  It's accepted by the
superset.  However, It's rejected by normal DFA.

On the other hands, It can be constituted just three DFA states.  It's
too simple.

             /\
            /  \
            \  /
             \/
  1:a ---> 2:CSET ---> 3:b

Thanks,
Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 23:00:03 GMT) Full text and rfc822 format available.

Message #42 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 07:58:44 +0900
Norihiro Tanaka wrote:
> For example, I try following text for the pattern `a.b'. 

In UTF8, the pattern `a.b' doesn't use the superset.  Consider `a[d-z]b'
and/or `\(a\)\1b' instead of it.

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Tue, 01 Apr 2014 23:57:01 GMT) Full text and rfc822 format available.

Message #45 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 08:55:53 +0900
Paolo Bonzini <bonzini <at> gnu.org> wrote:
> Better translate it to a[\x0-\xff]{1,2}b or something similar. 

I also thought that previously.  However, since we don't ask an exact
match for the superset, that is believed to be meaningless.

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Wed, 02 Apr 2014 07:10:02 GMT) Full text and rfc822 format available.

Message #48 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16966 <at> debbugs.gnu.org
Subject: Re: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 09:09:05 +0200
Il 02/04/2014 00:21, Norihiro Tanaka ha scritto:
> 1 accccccccccccccccccccccccccccccccccccccc
> 2 cccccccccccccccccccccccccccccccccccccccc
> 3 cccccccccccccccccccccccccccccccccccccccc
> 4 cccccccccccccccccccccccccccccccccccccccc
> 5 accccccccccccccccccccccccccccccccccccccb
>
> Then all lines are matched fast.  However, because matches multiple
> lines, retry from the last line (line 5).  It's accepted by the
> superset.  However, It's rejected by normal DFA.
>
> On the other hands, It can be constituted just three DFA states.  It's
> too simple.
>
>              /\
>             /  \
>             \  /
>              \/
>   1:a ---> 2:CSET ---> 3:b

Does anything change if there are a few million c's?

Paolo




Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Wed, 02 Apr 2014 13:03:02 GMT) Full text and rfc822 format available.

Message #51 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 22:02:01 +0900
Paolo Bonzini wrote:
> Does anything change if there are a few million c's?

The superset of `a ANYCHAR b' is 'a CSET STAR b'.
It's DFA states are following.

s0: The position set is none. 
s1: The position set is 1:a 
s2: The position set is 1:a 2:CSET
s3: The position set is 1:a 2:CSET 3:b (accepted)

        1  accccccccccccccccccccccccccccccccccccccc
           ^ s1

        1  accccccccccccccccccccccccccccccccccccccc
            ^ s2

        1  accccccccccccccccccccccccccccccccccccccc
             ^ s2

              ......

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                 ^ s2

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                  ^ s3 accepted

Then re-searched with the superset because matched on multi-lines.

1,000,000  cccccccccccccccccccccacccccccccccccccccb
           ^ s0


1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                  ^ s3 accepted

Then re-searched with the original DFA becuase matched on one-line.
The DFA states are following.

s0: The position set is none. 
s1: The position set is 1:a 
s2: The position set is 1:a 2:ANYCHAR
s3: The position set is 1:a 2:ANYCHAR 3:b (accepted)

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                  ^ s0 rejected



Even if matched multi-line, It's still as fast and memory-required as
matched single-line, because in search-mode STAR doesn't count number
of NFA and DFA states up.  So for example, a set of DFA states for
`a CSET STAR b' is correspond with for `a CSET b'. (If my recognition is
right.)

OTOH, If CSET{1,mb_cur_max} is assigned for ANYCHAR, number of DFA
states will be greater than above, and I may be slightly slower
becuase of overheads to build DFA states.  Notice that `{m,n}'
expression requires a lot of memory and is slow (See also
http://www.gnu.org/software/grep/manual/grep.html#Reporting-Bugs).

Example, use the superset for following pattern in EUC-JP locale.
It's mb_cur_max == 3.

    a.....b

If ANYCHAR is replaced to `CSET STAR', as following.

    a CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR b

I draw it below.

        /\
       /  \  2:CSET
       \  /
        \/
       1:a ---> 3:b
(The figure drawn previously was wrong.)

It's equal to following.

    a CSET STAR b

OTOH, the period is replaced to `CSET CSET CSET CAT OR CSET CSET CAT
CSET CAT OR', below.  Itt's complicated, and I don't want to draw it. :)

    a CSET CSET CSET CAT OR CSET CSET CAT CSET CAT OR CSET CSET CSET
    CAT OR CSET CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET
    CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET CSET CAT CSET
    CAT OR CAT b

Thanks,
Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Wed, 02 Apr 2014 13:53:02 GMT) Full text and rfc822 format available.

Message #54 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 22:52:43 +0900
Norihiro Tanaka wrote:
> s0: The position set is none. 
> s1: The position set is 1:a 
> s2: The position set is 1:a 2:CSET
> s3: The position set is 1:a 2:CSET 3:b (accepted)

Sorry, it was wrong.  It should be as follows.

s0: The position set is none.
s1: The position set is 1:a
s2: The position set is 1:a 2:CSET
s3: The position set is 2:CSET 3:b (accepted)
s4: The position set is 2:CSET

        1  accccccccccccccccccccccccccccccccccccccc
           ^ s1

        1  accccccccccccccccccccccccccccccccccccccc
            ^ s4

        1  accccccccccccccccccccccccccccccccccccccc
             ^ s4

              ......

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                 ^ s4

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                  ^ s3 accepted

Then re-searched with the superset because matched on multi-lines.

1,000,000  cccccccccccccccccccccacccccccccccccccccb
           ^ s0


1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s4

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                 ^ s4

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                  ^ s3 accepted


s0: The position set is none. 
s1: The position set is 1:a 
s2: The position set is 1:a 2:ANYCHAR
s3: The position set is 2:ANYCHAR 3:b (accepted)
s4: The position set is 2:ANYCHAR

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s4

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                  ^ s0 rejected





Information forwarded to bug-grep <at> gnu.org:
bug#16966; Package grep. (Fri, 04 Apr 2014 03:30:02 GMT) Full text and rfc822 format available.

Message #57 received at 16966 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>, 16966 <at> debbugs.gnu.org
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Fri, 04 Apr 2014 12:28:47 +0900
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> s0: The position set is none.
> s1: The position set is 1:a
> s2: The position set is 1:a 2:CSET
> s3: The position set is 2:CSET 3:b (accepted)
> s4: The position set is 2:CSET

Sorry, it was wrong yet.  It should be as follows.

s0: The position set is none.
s1: The position set is 1:a 2:CSET 3:b
s2: The position set is 1:a 2:CSET 3:b (accepted)

First of all, I noticed that the example is wrong, because if fixed
strings are included in the pattern, kwset is used for the pattern,
and dfahint() is called for each single line.

--
By the way, I compared two cases, which are CSET* and
CSET{1,mb_cur_max}.  I used `\(a\|z\)[x-y]\{10\}\(b\|z\)' as the pattern.
`[x-z]' is MBCSET in UTF-8 locale.

If replace MBCSET to CSET*, the pattern in the superset is
`\(a\|z\)\(CSET*\)\{10\}\(b\|z\)'.
If replace MBCSET to CSET{1,mb_cur_max}, the pattern in the superset is
`\(a\|z\)\(CSET\{1,6\}\)\{10\}\(b\|z\)'.

In order to simplify, use `\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' in
POSIX locale for later.

Before testing, apply the patch attached on this mail, Set `DDEBUG' to
CPPFLAGS, and re-build.  It outputs more debugging information.

  $ yes `printf '%0100d\n'` | head -320 | \
      sed -e '1s/^0/a/; $s/0$/b/' | \
      env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' 2>debug1.log
  $ yes `printf '%0100d\n'` | head -320 | \
      sed -e '1s/^0/a/; $s/0$/b/' | \
      env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' 2>debug2.log

(`320' originates in the default size of a buffer.)

The formar is only built 3 states.
OTOH, the later is built 224 dfastates.
i.e. even if matched multi-lines, many states aren't necessarily built. 

Next, I checked performance for each case in worst text.
It's best in 10 times.

yes `printf '%0100d\n'` | head -2 | sed -e '1s/^0/a/; $s/0$/b/' >t
yes t | head -1000000 | xargs cat >u

time -p env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' u
    real 1.26       user 0.95       sys 0.28
time -p env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' u
    real 0.74       user 0.45       sys 0.27

Though, the later is fast, I like the former, because it's very simple.
the later is complicated, also needs to update nleaves and depth.

Norihiro
[patch.txt (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 07 Apr 2014 05:10:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Mon, 07 Apr 2014 05:10:08 GMT) Full text and rfc822 format available.

Message #62 received at 16966-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paolo Bonzini <bonzini <at> gnu.org>, 16966-done <at> debbugs.gnu.org
Subject: Re: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Sun, 06 Apr 2014 22:08:53 -0700
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> I rebased this patch, and added four fixes to it.

Thanks.  This patch seems like a real performance win for grep -i in the 
usual case, so I installed it into the master.  If we run into 
performance problems in unusual cases I suppose we can look into them as 
they arise.

In reviewing the patch I found one memory-access typo:

> -              if ((end = memchr(beg, eol, buflim - beg)) != NULL)
> ...
> +              if ((end = memchr(next_beg, eol, buflim - beg)) != NULL)

That last "beg" should be "next_beg".

I merged the patch into the current master on Savannah (first attached 
file) and then applied a fixup patch (second attached file) that fixes 
this bug and makes some other minor related improvements, e.g., using 
memrchr rather than looking for eol by hand.
[0001-grep-optimization-with-the-superset-of-DFA.patch (text/plain, attachment)]
[0002-grep-cleanup-DFA-superset-optimization.patch (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 05 May 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 106 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.