GNU bug report logs - #17027
[PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 17 Mar 2014 15:02:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #8 received at 17027 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17027 <at> debbugs.gnu.org
Subject: Re: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in
 non-UTF8 locales
Date: Tue, 01 Apr 2014 10:51:59 +0200
Il 17/03/2014 16:01, Norihiro Tanaka ha scritto:
> Package: grep
> Tags: patch
> 
> When ANYCHAR is included in a pattern in non-UTF8 locales, grep prefer
> to DFA engine to regex's.  However, as long as I tested, even after have
> applied Patch#17025, regex engine is slower than DFA's for ANYCHAR in
> non-UTF8 locales.
> 
> This patch prefers regex to DFA for ANYCHAR in non-UTF8 locales.
> 
> Create the text.
> 
> $ yes abcd.abc | head -1000000 > m
> 
> I tested below before applying it.
> 
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.99
> user 1.75
> sys 0.28
> 
> I re-tested after applying it.
> 
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.21
> user 0.71
> sys 0.46
> 
> Norihiro
> 

Hi Norihiro,

what about something like this instead (untested)?

Paolo

diff --git a/src/dfa.c b/src/dfa.c
index c06c922..f756194 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -299,6 +299,7 @@ typedef struct
   position_set elems;           /* Positions this state could match.  */
   unsigned char context;        /* Context from previous state.  */
   char backref;                 /* True if this state matches a \<digit>.  */
+  bool has_mbcset;              /* True if this state matches a MBCSET.  */
   unsigned short constraint;    /* Constraint for this state to accept.  */
   token first_end;              /* Token value of the first END in elems.  */
   position_set mbps;            /* Positions which can match multibyte
@@ -2645,6 +2646,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
           if (d->states[s].mbps.nelem == 0)
             alloc_position_set (&d->states[s].mbps, 1);
           insert (pos, &(d->states[s].mbps));
+          d->states[s].has_mbcset |= (d->tokens[pos.index] == MBCSET);
           continue;
         }
       else
@@ -3450,7 +3452,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
                  better performance (up to 25% better on [a-z], for
                  example) and enables support for collating symbols and
                  equivalence classes.  */
-              if (backref)
+              if (d->states[s].has_mbcset && backref)
                 {
                   *backref = 1;
                   free (mblen_buf);





This bug report was last modified 11 years and 125 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.