On Mon, Jul 20, 2015 at 8:14 AM, Norihiro Tanaka wrote: > > On Sun, 19 Jul 2015 20:14:52 -0700 > Jim Meyering wrote: > >> Thank you for the additional information and the test script. >> I like most of this patch, but not the fact that it causes the >> word-delim-multibyte test to fail. I do see that also applying your >> following patch makes that test pass once again. However, it does so >> at the cost of forcing a new class of regexps (any that contain a use >> of \b, \< or \>) from DFA into the slower regex matcher. > > I think DFA forces regex for BEGWORD, LIMWORD, ENDWORD, instead of > whether patching or not. Could you remark code in dfassbuild() without > patching? It seem that DFA rejects their words from before. > > case BEGWORD: > case ENDWORD: > case LIMWORD: > case NOTLIMWORD: > if (d->multibyte) > { > /* These constraints aren't supported in a multibyte locale. > Ignore them in the superset DFA, and treat them as > backreferences in the main DFA. */ > sup->tokens[j++] = EMPTY; > d->tokens[i] = BACKREF; <<<< > break; > } > > DFA does not handle word context in multibyte correctly. Perhaps, if we > fix it, DFA will take a performance penalty. You're right. That covers it. This patch is good also for eliminating some technical debt. Since your change eliminates the code below (effectively making the same change from conjunct to disjunct), there is no reason to make the following correction: /* Falling back to the glibc matcher in this case gives \ better performance (up to 25% better on [a-z], for \ example) and enables support for collating symbols and \ equivalence classes. */ \ - if (d->states[s].has_mbcset && backref) \ + if (d->states[s].has_mbcset || backref) \ { \ *backref = 1; \ goto done; \ } \ At first I was surprised to see that using "&&" there provoked no test failure, but then saw that we test "backref" shortly thereafter. While technically, using "&&" there is a bug, it seems the effect was negligible. I have made some changes, renaming functions, e.g., dfabackref -> dfa_supported, since even before this change "backref" meant more than the presence of a backreference. Note that while your commit log entry described new functions, I have modified the commit log to say merely "new function" for each. Instead, I document the new functions in the code, where that documentation will be more useful, and maintained. Please let me know if there is anything you'd like to change: