GNU bug report logs -
#43862
[PATCH] grep: set RE_NO_SUB for calling regex only to check syntax
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Thu, 8 Oct 2020 09:41:01 UTC
Severity: normal
Tags: patch
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your bug report
#43862: [PATCH] grep: set RE_NO_SUB for calling regex only to check syntax
which was filed against the grep package, has been closed.
The explanation is attached below, along with your original report.
If you require more details, please reply to 43862 <at> debbugs.gnu.org.
--
43862: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=43862
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
On Mon, Oct 12, 2020 at 4:08 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Thu, Oct 8, 2020 at 2:41 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> >
> > We can set RE_NO_SUB for calling regex only to check syntax. It brings
> > performance gains in cases to have a lot of enormous epsilon nodes.
> >
> >
> > $ printf '(%020000d)\n' | sed 's/0/|/g' >pat
> >
> > (before)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 6.15
> > user 4.62
> > sys 1.52
> >
> > (after)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 0.66
> > user 0.19
> > sys 0.46
>
> Thank you.
>
> FYI, when running similar commands with and without your patch (with
> an eye to adding a test), I ran this one (with your patch). It shows
> that using 80,000 terms caused grep to consume 32GB of memory before
> being OOM-killed:
>
> $ printf '(%080000d)\n' | sed 's/0/|/g' | env time src/grep -Ef- /dev/null
> Command terminated by signal 9
> 6.42user 19.98system 0:57.91elapsed 45%CPU (0avgtext+0avgdata
> 32024460maxresident)k
> 6504inputs+0outputs (92major+12003644minor)pagefaults 0swaps
> [Exit 137 (KILL)]
>
> I will come back to this later this week.
We must accept the fact that extreme regular expressions will cause
resource exhaustion like that when processed by classical regex_*
functions. This is yet another good reason to prefer PCRE and to use
grep's -P option. In that case, it fails like this:
$ printf '(%080000d)\n' | sed 's/0/|/g' |grep -Pf- /dev/null
grep: regular expression is too large
I have just pushed your patch, but without adding a test.
[Message part 3 (message/rfc822, inline)]
[Message part 4 (text/plain, inline)]
We can set RE_NO_SUB for calling regex only to check syntax. It brings
performance gains in cases to have a lot of enormous epsilon nodes.
$ printf '(%020000d)\n' | sed 's/0/|/g' >pat
(before)
$ time -p env LC_ALL=C src/grep -Ef pat /dev/null
real 6.15
user 4.62
sys 1.52
(after)
$ time -p env LC_ALL=C src/grep -Ef pat /dev/null
real 0.66
user 0.19
sys 0.46
[0001-grep-set-RE_NO_SUB-for-calling-regex-only-to-check-s.patch (text/plain, attachment)]
This bug report was last modified 4 years and 205 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.