GNU bug report logs - #19306
[PATCH 1/2] dfa: avoid execution for a pattern including an unsupported expression

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 8 Dec 2014 15:26:01 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 19306 <at> debbugs.gnu.org
Subject: bug#19306: [PATCH 1/2] dfa: avoid execution for a pattern including an unsupported expression
Date: Sun, 26 Jul 2015 09:10:05 -0700
[Message part 1 (text/plain, inline)]
On Mon, Jul 20, 2015 at 8:14 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Sun, 19 Jul 2015 20:14:52 -0700
> Jim Meyering <jim <at> meyering.net> wrote:
>
>> Thank you for the additional information and the test script.
>> I like most of this patch, but not the fact that it causes the
>> word-delim-multibyte test to fail. I do see that also applying your
>> following patch makes that test pass once again.  However, it does so
>> at the cost of forcing a new class of regexps (any that contain a use
>> of \b, \<  or \>) from DFA into the slower regex matcher.
>
> I think DFA forces regex for BEGWORD, LIMWORD, ENDWORD, instead of
> whether patching or not.  Could you remark code in dfassbuild() without
> patching?  It seem that DFA rejects their words from before.
>
>         case BEGWORD:
>         case ENDWORD:
>         case LIMWORD:
>         case NOTLIMWORD:
>           if (d->multibyte)
>             {
>               /* These constraints aren't supported in a multibyte locale.
>                  Ignore them in the superset DFA, and treat them as
>                  backreferences in the main DFA.  */
>               sup->tokens[j++] = EMPTY;
>               d->tokens[i] = BACKREF;  <<<<
>               break;
>             }
>
> DFA does not handle word context in multibyte correctly.  Perhaps, if we
> fix it, DFA will take a performance penalty.

You're right. That covers it.

This patch is good also for eliminating some technical debt.
Since your change eliminates the code below (effectively making
the same change from conjunct to disjunct), there is no reason
to make the following correction:

               /* Falling back to the glibc matcher in this case gives   \
                  better performance (up to 25% better on [a-z], for     \
                  example) and enables support for collating symbols and \
                  equivalence classes.  */                               \
-              if (d->states[s].has_mbcset && backref)                   \
+              if (d->states[s].has_mbcset || backref)                   \
                 {                                                       \
                   *backref = 1;                                         \
                   goto done;                                            \
                 }                                                       \

At first I was surprised to see that using "&&" there provoked
no test failure, but then saw that we test "backref" shortly thereafter.
While technically, using "&&" there is a bug, it seems the effect was
negligible.

I have made some changes, renaming functions, e.g., dfabackref -> dfa_supported,
since even before this change "backref" meant more than the presence
of a backreference.

Note that while your commit log entry described new functions, I have
modified the commit
log to say merely "new function" for each. Instead, I document the new
functions in the code,
where that documentation will be more useful, and maintained.

Please let me know if there is anything you'd like to change:
[0001-dfa-avoid-execution-for-a-pattern-including-an-unsup.patch (text/x-patch, attachment)]
[0002-dfa-remove-word-delimiter-support-for-multibyte-loca.patch (text/x-patch, attachment)]

This bug report was last modified 9 years and 363 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.