GNU bug report logs -
#43527
[PATCH] grep: avoid unneeded compilation of regex
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Sun, 20 Sep 2020 07:17:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
On Sun, Sep 20, 2020 at 6:34 PM Jim Meyering <jim <at> meyering.net> wrote:
>
> On Sun, Sep 20, 2020 at 12:17 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > Hi,
> > Performace for as following case is fixed in bug#43040.
> >
> > $ yes 0 | head -100000 | sed '$s/././' >pat
> > $ grep -vf pat /dev/null
> >
> > However, still slow and a lot of memory wasted for the following cases.
> >
> > $ grep -vf /usr/share/dict/linux.words /usr/share/dict/linux.words
> >
> > This bug is introduced in commit abb7f4f2325f26f930ff59b702fe42568a8e81e7.
> > Though it's an optimization for patterns with backreferences, it seems
> > to cause performance degradation in many cases due to regex
> > implementation issues.
> >
> > grep needs regex engine when patterns is not supported by DFA engine,
> > and when either given only matching (-o) or color option (--color) is
> > given.
> >
> > In other words, if none of them are met, grep only uses regex to check
> > the syntax. grep avoids compilation of regex not to check syntax by this
> > patch.
>
> Yikes. Thank you!
> That exposes (and fixes in this common case) a problem that makes grep
> require memory that is quadratic in the number of regular expressions.
>
> To illustrate, I ran some timings.
> With only 80,000 lines of /usr/share/dict/linux.words, the following
> would use 100GB of RSS and take 3 minutes. With the fix, it used less
> than 400MB and took less than one second.
>
> head -$N /usr/share/dict/linux.words > w; grep -vf w w
>
> N Mem(k): Old New
> 20000 6341188 (2.4s) 103168
> 40000 25241288 (9.29s) 199188 (0.31s)
> 80000 100547432 (180s) 392872 (0.66s)
>
> I've just pushed the gnulib-adjusting patch and will push the other soon.
> I'll also add a test and a NEWS item in a separate patch.
Here are the two patches (tested on top of a third that updates to
latest gnulib). I'll await an 'ok' from Norihiro Tanaka before
pushing, since commit-log metadata is essentially immutable once
pushed.
[0002-tests-test-for-many-regexp-N-2-RSS-regression.patch (application/octet-stream, attachment)]
[0001-grep-avoid-unnecessary-regex-compilation.patch (application/octet-stream, attachment)]
This bug report was last modified 4 years and 299 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.