GNU bug report logs - #33249
[PATCH] grep: grouping of patterns including back reference

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 3 Nov 2018 11:43:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#33249: closed ([PATCH] grep: grouping of patterns including
 back reference)
Date: Mon, 23 Dec 2019 00:58:01 +0000
[Message part 1 (text/plain, inline)]
Your message dated Sun, 22 Dec 2019 16:57:12 -0800
with message-id <49e7c978-7da6-fc38-1285-7c7c3e6eb50a <at> cs.ucla.edu>
and subject line Re: bug#33249: [PATCH] grep: grouping of patterns including back reference
has caused the debbugs.gnu.org bug report #33249,
regarding [PATCH] grep: grouping of patterns including back reference
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
33249: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=33249
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: [PATCH] grep: grouping of patterns including back reference
Date: Sat, 03 Nov 2018 20:41:46 +0900
[Message part 3 (text/plain, inline)]
Hi,

When grep uses regex, it splits a pattern with multiple lines by
newline character.  Compilation and executution run for each fragment. 
That causes slowdown.  By this change, each fragment is divided into
groups by whether the fragment includes back reference in a pattern or
not. a frgment which includes back reference constitutes group, and all
frgments which include back reference also constitute a group.

This change extremely speeds-up following case.

- before

    $ yes 00000000000000000000000000000000000000000x | head -10 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 1.94
    user 1.70
    sys 0.24

    $ yes 00000000000000000000000000000000000000000x | head -100 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 13.04
    user 12.78
    sys 0.25

- after

    $ yes 00000000000000000000000000000000000000000x | head -10 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 0.75
    user 0.56
    sys 0.17

    $ yes 00000000000000000000000000000000000000000x | head -100 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 0.74
    user 0.57
    sys 0.16

Thanks,
Norihiro
[0001-grep-grouping-of-patterns-including-back-reference.patch (text/plain, attachment)]
[Message part 5 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 33249-done <at> debbugs.gnu.org
Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back
 reference
Date: Sun, 22 Dec 2019 16:57:12 -0800
[Message part 6 (text/plain, inline)]
On 11/3/18 9:25 PM, Norihiro Tanaka wrote:

>   $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat

Thanks for the test case and sorry about the delay. And thanks for spotting the
speedup opportunity. I found a few problems with the proposed patch, though:

> +        if (keys[1] == '\\')
> +          keys++;

This mishandles the case where the input byte sequence contains '\', '\', '1'
where the first '\' is the last byte of a multibyte character. Such a byte
sequence can contain backreferences but the function will say it doesn't.

> +      if (backref && prev < p)
> +        {
> +          len = p - prev;
> +          buf = xrealloc (buf, (buflen + len) * sizeof *buf);
> +          memcpy (buf + buflen, p, len);
> +          buflen += len;
> +        }

This seems to have three problems. First, the memcpy copies from P, but it
should copy from PREV. Second, this code assigns to LEN, which breaks a later
use of LEN. Third, if there are many patterns with backreferences, repeated use
of realloc will have O(N**2) behavior.

> +      for (size_t i = 0; i < dc->pcount; i++)
> +        dc->patterns[i + 1] = dc->patterns[i];

This copies dc->patterns[0] to the later values in that array, when a memmove
was intended.

So, after installing the patch, I immediately installed the attached patch,
which should address the abovementioned issues.

Thanks again. You did the hard work - I merely proofread it.
[0001-grep-fix-some-bugs-in-pattern-grouping-speedup.patch (text/x-patch, attachment)]

This bug report was last modified 5 years and 236 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.