GNU bug report logs - #17027
[PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 17 Mar 2014 15:02:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17027 in the body.
You can then email your comments to 17027 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Mon, 17 Mar 2014 15:02:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 17 Mar 2014 15:02:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: submit <at> debbugs.gnu.org
Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales
Date: Tue, 18 Mar 2014 00:01:05 +0900
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch

When ANYCHAR is included in a pattern in non-UTF8 locales, grep prefer
to DFA engine to regex's.  However, as long as I tested, even after have
applied Patch#17025, regex engine is slower than DFA's for ANYCHAR in
non-UTF8 locales.

This patch prefers regex to DFA for ANYCHAR in non-UTF8 locales.

Create the text.

$ yes abcd.abc | head -1000000 > m

I tested below before applying it.

$ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
real 1.99
user 1.75
sys 0.28

I re-tested after applying it.

$ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
real 1.21
user 0.71
sys 0.46

Norihiro
[patch2.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Tue, 01 Apr 2014 08:53:01 GMT) Full text and rfc822 format available.

Message #8 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17027 <at> debbugs.gnu.org
Subject: Re: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in
 non-UTF8 locales
Date: Tue, 01 Apr 2014 10:51:59 +0200
Il 17/03/2014 16:01, Norihiro Tanaka ha scritto:
> Package: grep
> Tags: patch
> 
> When ANYCHAR is included in a pattern in non-UTF8 locales, grep prefer
> to DFA engine to regex's.  However, as long as I tested, even after have
> applied Patch#17025, regex engine is slower than DFA's for ANYCHAR in
> non-UTF8 locales.
> 
> This patch prefers regex to DFA for ANYCHAR in non-UTF8 locales.
> 
> Create the text.
> 
> $ yes abcd.abc | head -1000000 > m
> 
> I tested below before applying it.
> 
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.99
> user 1.75
> sys 0.28
> 
> I re-tested after applying it.
> 
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.21
> user 0.71
> sys 0.46
> 
> Norihiro
> 

Hi Norihiro,

what about something like this instead (untested)?

Paolo

diff --git a/src/dfa.c b/src/dfa.c
index c06c922..f756194 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -299,6 +299,7 @@ typedef struct
   position_set elems;           /* Positions this state could match.  */
   unsigned char context;        /* Context from previous state.  */
   char backref;                 /* True if this state matches a \<digit>.  */
+  bool has_mbcset;              /* True if this state matches a MBCSET.  */
   unsigned short constraint;    /* Constraint for this state to accept.  */
   token first_end;              /* Token value of the first END in elems.  */
   position_set mbps;            /* Positions which can match multibyte
@@ -2645,6 +2646,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
           if (d->states[s].mbps.nelem == 0)
             alloc_position_set (&d->states[s].mbps, 1);
           insert (pos, &(d->states[s].mbps));
+          d->states[s].has_mbcset |= (d->tokens[pos.index] == MBCSET);
           continue;
         }
       else
@@ -3450,7 +3452,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
                  better performance (up to 25% better on [a-z], for
                  example) and enables support for collating symbols and
                  equivalence classes.  */
-              if (backref)
+              if (d->states[s].has_mbcset && backref)
                 {
                   *backref = 1;
                   free (mblen_buf);





Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Tue, 01 Apr 2014 14:52:01 GMT) Full text and rfc822 format available.

Message #11 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: 17027 <at> debbugs.gnu.org
Subject: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8
 locales
Date: Tue, 01 Apr 2014 23:51:29 +0900
[Message part 1 (text/plain, inline)]
Hi Paolo,

I applied same type and naming to member `backref' of dfastate.
And I checked to pass regression tests.

Thanks,
Norihiro
[patch.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Wed, 02 Apr 2014 23:28:02 GMT) Full text and rfc822 format available.

Message #14 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 17027 <at> debbugs.gnu.org
Subject: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8
 locales
Date: Thu, 03 Apr 2014 08:27:17 +0900
[Message part 1 (text/plain, inline)]
I changed the type of `has_backref' into `bool'.

Norihiro
[patch.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Thu, 03 Apr 2014 15:20:01 GMT) Full text and rfc822 format available.

Message #17 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 17027 <at> debbugs.gnu.org
Subject: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8
 locales
Date: Fri, 04 Apr 2014 00:19:02 +0900
[Message part 1 (text/plain, inline)]
We need to intialize the new member.
I add it to the patch.
[patch.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Mon, 07 Apr 2014 12:09:01 GMT) Full text and rfc822 format available.

Message #20 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17027 <at> debbugs.gnu.org
Subject: Re: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in
 non-UTF8 locales
Date: Mon, 07 Apr 2014 21:07:58 +0900
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> We need to intialize the new member.
> I add it to the patch.

I haven't added it yet.  I have done now.
[patch.txt (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Tue, 08 Apr 2014 04:09:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Tue, 08 Apr 2014 04:09:02 GMT) Full text and rfc822 format available.

Message #25 received at 17027-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17027-done <at> debbugs.gnu.org
Subject: Re: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in
 non-UTF8 locales
Date: Mon, 07 Apr 2014 21:08:44 -0700
Thanks for this patch too.  I pushed it into the savannah git master, 
with a slightly different commit message.




Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Tue, 08 Apr 2014 06:13:02 GMT) Full text and rfc822 format available.

Message #28 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17027 <at> debbugs.gnu.org
Subject: Re: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in
 non-UTF8 locales
Date: Mon, 07 Apr 2014 23:12:07 -0700
[Message part 1 (text/plain, inline)]
As the patch went partway in using bool in dfa.c, I took the liberty of 
converting the rest of the DFA internals to use bool when that seemed to 
make the code clearer and/or the executable smaller.  I didn't change 
the external API.  I installed the attached.
[0001-grep-prefer-bool-in-DFA-internals.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Tue, 08 Apr 2014 15:05:02 GMT) Full text and rfc822 format available.

Message #31 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17027 <at> debbugs.gnu.org
Subject: Re: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in
 non-UTF8 locales
Date: Wed, 09 Apr 2014 00:04:00 +0900
Paul, thanks for the review and installation for my many patches.
I'm sure that it's much faster than grep-2.18 in many cases.





Information forwarded to bug-grep <at> gnu.org:
bug#17027; Package grep. (Tue, 08 Apr 2014 20:20:02 GMT) Full text and rfc822 format available.

Message #34 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 17027 <at> debbugs.gnu.org
Subject: Re: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in
 non-UTF8 locales
Date: Tue, 08 Apr 2014 13:19:03 -0700
[Message part 1 (text/plain, inline)]
Aharon Robbins privately pointed out that the bool_bf business wasn't 
worth the hassle, so I installed the attached patch to simplify.  
Thanks, Aharon!

[0001-grep-remove-bool_bf.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 07 May 2014 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 124 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.