GNU bug report logs -
#17027
[PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Mon, 17 Mar 2014 15:02:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #8 received at 17027 <at> debbugs.gnu.org (full text, mbox):
Il 17/03/2014 16:01, Norihiro Tanaka ha scritto:
> Package: grep
> Tags: patch
>
> When ANYCHAR is included in a pattern in non-UTF8 locales, grep prefer
> to DFA engine to regex's. However, as long as I tested, even after have
> applied Patch#17025, regex engine is slower than DFA's for ANYCHAR in
> non-UTF8 locales.
>
> This patch prefers regex to DFA for ANYCHAR in non-UTF8 locales.
>
> Create the text.
>
> $ yes abcd.abc | head -1000000 > m
>
> I tested below before applying it.
>
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.99
> user 1.75
> sys 0.28
>
> I re-tested after applying it.
>
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.21
> user 0.71
> sys 0.46
>
> Norihiro
>
Hi Norihiro,
what about something like this instead (untested)?
Paolo
diff --git a/src/dfa.c b/src/dfa.c
index c06c922..f756194 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -299,6 +299,7 @@ typedef struct
position_set elems; /* Positions this state could match. */
unsigned char context; /* Context from previous state. */
char backref; /* True if this state matches a \<digit>. */
+ bool has_mbcset; /* True if this state matches a MBCSET. */
unsigned short constraint; /* Constraint for this state to accept. */
token first_end; /* Token value of the first END in elems. */
position_set mbps; /* Positions which can match multibyte
@@ -2645,6 +2646,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
if (d->states[s].mbps.nelem == 0)
alloc_position_set (&d->states[s].mbps, 1);
insert (pos, &(d->states[s].mbps));
+ d->states[s].has_mbcset |= (d->tokens[pos.index] == MBCSET);
continue;
}
else
@@ -3450,7 +3452,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
better performance (up to 25% better on [a-z], for
example) and enables support for collating symbols and
equivalence classes. */
- if (backref)
+ if (d->states[s].has_mbcset && backref)
{
*backref = 1;
free (mblen_buf);
This bug report was last modified 11 years and 125 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.