GNU bug report logs - #18762
[PATCH] dfa: don't consider RE_DOT_NEWLINE and RE_DOT_NOT_NULL in matching with a bracket expression

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 18 Oct 2014 12:41:03 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18762 in the body.
You can then email your comments to 18762 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#18762; Package grep. (Sat, 18 Oct 2014 12:41:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 18 Oct 2014 12:41:04 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: bug-grep <at> gnu.org
Subject: [PATCH] dfa: don't consider RE_DOT_NEWLINE and RE_DOT_NOT_NULL in
 matching with a bracket expression
Date: Sat, 18 Oct 2014 21:39:37 +0900
[Message part 1 (text/plain, inline)]
RE_DOT_NEW_LINE and NOT_NULL work for '.' only in regex.  OTOH, they
work for MBCSET in addition to '.' in DFA.  This patch adapts the behavior
of DFA to of regex.

BTW, at the moment, grep and gawk never use match_mb_charset function to
be fixed by it.
[0001-dfa-don-t-consider-RE_DOT_NEWLINE-and-RE_DOT_NOT_NUL.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18762; Package grep. (Sat, 18 Oct 2014 17:07:02 GMT) Full text and rfc822 format available.

Message #8 received at 18762 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18762 <at> debbugs.gnu.org
Subject: Re: bug#18762: [PATCH] dfa: don't consider RE_DOT_NEWLINE and
 RE_DOT_NOT_NULL in matching with a bracket expression
Date: Sat, 18 Oct 2014 10:06:33 -0700
On Sat, Oct 18, 2014 at 5:39 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> RE_DOT_NEW_LINE and NOT_NULL work for '.' only in regex.  OTOH, they
> work for MBCSET in addition to '.' in DFA.  This patch adapts the behavior
> of DFA to of regex.
>
> BTW, at the moment, grep and gawk never use match_mb_charset function to
> be fixed by it.

Thank you for the patch. It is clearly correct.
However, it presents a puzzle: does your patch induce any semantic
change in grep?
I.e., is this a bug fix, or simply the removal of code that would have
no effect.
So far, I have been unable to construct a case for which it induces a
semantic change.

On the other hand, this does eliminate a few comparisons,
so there may be a small performance improvement.




Information forwarded to bug-grep <at> gnu.org:
bug#18762; Package grep. (Sat, 18 Oct 2014 23:31:02 GMT) Full text and rfc822 format available.

Message #11 received at 18762 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 18762 <at> debbugs.gnu.org
Subject: Re: bug#18762: [PATCH] dfa: don't consider RE_DOT_NEWLINE and
 RE_DOT_NOT_NULL in matching with a bracket expression
Date: Sun, 19 Oct 2014 08:30:06 +0900
Thanks for the review.

This is a potential bug fix.  However, match_mb_charset function isn't
used in grep and gawk, as DFA treats MBCSET as BACKREF by following code
if `backref' is provided.  Therefore the fix never induces any semantic
change in grep and gawk.

              if (d->states[s].has_mbcset && backref)
                {
                  *backref = 1;
                  goto done;
                }

Essentially, the function is able to be removed.  However, if we regard
DFA as a library,  we should keep it.





Information forwarded to bug-grep <at> gnu.org:
bug#18762; Package grep. (Sun, 19 Oct 2014 00:17:01 GMT) Full text and rfc822 format available.

Message #14 received at 18762 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18762 <at> debbugs.gnu.org
Subject: Re: bug#18762: [PATCH] dfa: don't consider RE_DOT_NEWLINE and
 RE_DOT_NOT_NULL in matching with a bracket expression
Date: Sat, 18 Oct 2014 17:16:13 -0700
On Sat, Oct 18, 2014 at 4:30 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Thanks for the review.
>
> This is a potential bug fix.  However, match_mb_charset function isn't
> used in grep and gawk, as DFA treats MBCSET as BACKREF by following code
> if `backref' is provided.

dfa.c's match_mb_charset function *is* used, e.g., in a
command like this one:

  printf '\0' |src/grep -aE '^\s?$'

However, as I mentioned, so far I have been unable to
construct a combination of syntax_bits settings and input/RE pairs
that induces a change in behavior.

>  Therefore the fix never induces any semantic
> change in grep and gawk.
>
>               if (d->states[s].has_mbcset && backref)
>                 {
>                   *backref = 1;
>                   goto done;
>                 }
>
> Essentially, the function is able to be removed.  However, if we regard
> DFA as a library,  we should keep it.




Information forwarded to bug-grep <at> gnu.org:
bug#18762; Package grep. (Sun, 19 Oct 2014 02:09:02 GMT) Full text and rfc822 format available.

Message #17 received at 18762 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 18762 <at> debbugs.gnu.org
Subject: Re: bug#18762: [PATCH] dfa: don't consider RE_DOT_NEWLINE and
 RE_DOT_NOT_NULL in matching with a bracket expression
Date: Sun, 19 Oct 2014 11:07:52 +0900
[Message part 1 (text/plain, inline)]
Jim Meyering <jim <at> meyering.net> wrote:
> dfa.c's match_mb_charset function *is* used, e.g., in a
> command like this one:
> 
>   printf '\0' |src/grep -aE '^\s?$'

Wow, just it isn't good.  I think that behavior of `fails' should be
same as of `trans' except `fails' checks accepted conditions, including
following part.  match_mb_charset() should be avoided as far as possible,
as it doesn't support collating symbols and equivalence classes.

>               /* Falling back to the glibc matcher in this case gives
>                  better performance (up to 25% better on [a-z], for
>                  example) and enables support for collating symbols and
>                  equivalence classes.  */
>               if (d->states[s].has_mbcset && backref)
>                 {
>                   *backref = 1;
>                   goto done;
>                 }

[0001-dfa-fall-MBCSET-back-to-the-glibc-matcher-in-transit.patch (text/plain, attachment)]

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Mon, 20 Oct 2014 01:26:01 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Mon, 20 Oct 2014 01:26:02 GMT) Full text and rfc822 format available.

Message #22 received at 18762-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18762-done <at> debbugs.gnu.org
Subject: Re: bug#18762: [PATCH] dfa: don't consider RE_DOT_NEWLINE and
 RE_DOT_NOT_NULL in matching with a bracket expression
Date: Sun, 19 Oct 2014 18:24:56 -0700
[Message part 1 (text/plain, inline)]
On Sat, Oct 18, 2014 at 7:07 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Jim Meyering <jim <at> meyering.net> wrote:
>> dfa.c's match_mb_charset function *is* used, e.g., in a
>> command like this one:
>>
>>   printf '\0' |src/grep -aE '^\s?$'
>
> Wow, just it isn't good.  I think that behavior of `fails' should be
> same as of `trans' except `fails' checks accepted conditions, including
> following part.  match_mb_charset() should be avoided as far as possible,
> as it doesn't support collating symbols and equivalence classes.
>
>>               /* Falling back to the glibc matcher in this case gives
>>                  better performance (up to 25% better on [a-z], for
>>                  example) and enables support for collating symbols and
>>                  equivalence classes.  */
>>               if (d->states[s].has_mbcset && backref)
>>                 {
>>                   *backref = 1;
>>                   goto done;
>>                 }

Nice change.  I've adjusted the commit log and added the test
above, since no other code even excercised the
now-inaccessible function. I will push it tomorrow.
[0001-dfa-process-all-MBCSET-constructs-via-glibc-s-matche.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18762; Package grep. (Mon, 20 Oct 2014 03:17:02 GMT) Full text and rfc822 format available.

Message #25 received at 18762-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18762-done <at> debbugs.gnu.org
Subject: Re: bug#18762: [PATCH] dfa: don't consider RE_DOT_NEWLINE and
 RE_DOT_NOT_NULL in matching with a bracket expression
Date: Sun, 19 Oct 2014 20:16:09 -0700
[Message part 1 (text/plain, inline)]
On Sun, Oct 19, 2014 at 6:24 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Sat, Oct 18, 2014 at 7:07 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>> Jim Meyering <jim <at> meyering.net> wrote:
>>> dfa.c's match_mb_charset function *is* used, e.g., in a
>>> command like this one:
>>>
>>>   printf '\0' |src/grep -aE '^\s?$'
>>
>> Wow, just it isn't good.  I think that behavior of `fails' should be
>> same as of `trans' except `fails' checks accepted conditions, including
>> following part.  match_mb_charset() should be avoided as far as possible,
>> as it doesn't support collating symbols and equivalence classes.
>>
>>>               /* Falling back to the glibc matcher in this case gives
>>>                  better performance (up to 25% better on [a-z], for
>>>                  example) and enables support for collating symbols and
>>>                  equivalence classes.  */
>>>               if (d->states[s].has_mbcset && backref)
>>>                 {
>>>                   *backref = 1;
>>>                   goto done;
>>>                 }
>
> Nice change.  I've adjusted the commit log and added the test
> above, since no other code even excercised the
> now-inaccessible function. I will push it tomorrow.

By the way, I've also adjusted your preceding patch (see attached),
and will push it tomorrow, too.
[0001-dfa-remove-two-erroneous-clauses-from-a-now-unused-f.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18762; Package grep. (Mon, 20 Oct 2014 14:16:01 GMT) Full text and rfc822 format available.

Message #28 received at 18762-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 18762-done <at> debbugs.gnu.org
Subject: Re: bug#18762: [PATCH] dfa: don't consider RE_DOT_NEWLINE and
 RE_DOT_NOT_NULL in matching with a bracket expression
Date: Mon, 20 Oct 2014 23:15:36 +0900
Jim Meyering <jim <at> meyering.net> wrote:
> > Nice change.  I've adjusted the commit log and added the test
> > above, since no other code even excercised the
> > now-inaccessible function. I will push it tomorrow.
> 
> By the way, I've also adjusted your preceding patch (see attached),
> and will push it tomorrow, too.

Thanks for the adjustments and addition of the test.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 18 Nov 2014 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 276 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.