GNU bug report logs - #21763
poor performance since grep 2.19 when comparing files with grep

Package: grep;

Reported by: "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>

Date: Mon, 26 Oct 2015 14:19:03 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21763 in the body.
You can then email your comments to 21763 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Mon, 26 Oct 2015 14:19:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 26 Oct 2015 14:19:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>
To: "bug-grep <at> gnu.org" <bug-grep <at> gnu.org>
Subject: poor performance since grep 2.19 when comparing files with grep
Date: Mon, 26 Oct 2015 12:54:01 +0000

Apologies in advance if this is more of a "discuss" question, but it looks like a particular use-case shows a marked change in performance between recent versions of grep.

A colleague mentioned a performance issue with grep to me, and its puzzling me a bit.
It turns out that he was using "grep -Fvif" to find lines in one file that are not present in another.

Up until grep 2.18 this seems to work with linear performance and it takes less than 50ms to compare files up to about 20,000 lines.
With grep 2.19 and later, ever relatively small files are quite slow, runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s to compare 400 lines, 5s to compare 600 lines).

I've shown my colleague how to use sort and diff (and "comm", which I think is vastly underrated), but it made me wonder if this is a reasonable thing to expect grep to be able to do, and whether such a performance drop should be seen as a bug.

The way he was using it, he had two (unsorted) data sets (about 6000 rows in each), with most lines being common, and he was just using:
grep -Fvif FILE1 FILE2
In his case, the older version of grep took way less than a second to run, but after he had upgraded his machine it took 20 minutes before running out of swap and seg faulting.

In terms of comparing performance, I've found that the following works to compare performance (vary N to try different sized data files):
N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; time grep -Fvif $F $F; rm $F

Steve.

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Mon, 26 Oct 2015 18:41:02 GMT) Full text and rfc822 format available.

Message #8 received at 21763 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>
Cc: 21763 <at> debbugs.gnu.org
Subject: Re: bug#21763: poor performance since grep 2.19 when comparing files
 with grep
Date: Mon, 26 Oct 2015 11:39:54 -0700

On Mon, Oct 26, 2015 at 5:54 AM, Bennett, Steve
<S.Bennett <at> lancaster.ac.uk> wrote:
> Apologies in advance if this is more of a "discuss" question, but it looks like a particular use-case shows a marked change in performance between recent versions of grep.
>
> A colleague mentioned a performance issue with grep to me, and its puzzling me a bit.
> It turns out that he was using "grep -Fvif" to find lines in one file that are not present in another.
>
> Up until grep 2.18 this seems to work with linear performance and it takes less than 50ms to compare files up to about 20,000 lines.
> With grep 2.19 and later, ever relatively small files are quite slow, runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s to compare 400 lines, 5s to compare 600 lines).
>
> I've shown my colleague how to use sort and diff (and "comm", which I think is vastly underrated), but it made me wonder if this is a reasonable thing to expect grep to be able to do, and whether such a performance drop should be seen as a bug.
>
> The way he was using it, he had two (unsorted) data sets (about 6000 rows in each), with most lines being common, and he was just using:
>     grep -Fvif FILE1 FILE2
> In his case, the older version of grep took way less than a second to run, but after he had upgraded his machine it took 20 minutes before running out of swap and seg faulting.
>
> In terms of comparing performance, I've found that the following works to compare performance (vary N to try different sized data files):
>     N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; time grep -Fvif $F $F; rm $F

Thank you for reporting that.
Interesting: that progression (time vs. increasing N) is clearly
quadratic or worse when using a multibyte locale, but is linear with
LC_ALL=C. I suspect when you run "locale", it reports something like
en_US.utf8.

I.e., if you have no need for multi-byte matching, set LC_ALL=C, and
that idiom will be very quick, even for a million lines:

  $ N=1000000; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1
$N > $F; LC_ALL=C env time grep -Fvif $F $F; rm $F
  0.78user 0.08system 0:00.86elapsed 100%CPU (0avgtext+0avgdata
131536maxresident)k
  0inputs+0outputs (0major+32587minor)pagefaults 0swaps

Currently, I am not planning even to investigate this for the imminent release.

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Tue, 27 Oct 2015 00:08:01 GMT) Full text and rfc822 format available.

Message #11 received at 21763 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 21763 <at> debbugs.gnu.org
Cc: Jim Meyering <jim <at> meyering.net>, "Bennett,
 Steve" <S.Bennett <at> lancaster.ac.uk>
Subject: Re: bug#21763: poor performance since grep 2.19 when comparing files
 with grep
Date: Tue, 27 Oct 2015 09:06:51 +0900

On Mon, 26 Oct 2015 11:39:54 -0700
Jim Meyering <jim <at> meyering.net> wrote:

> On Mon, Oct 26, 2015 at 5:54 AM, Bennett, Steve
> <S.Bennett <at> lancaster.ac.uk> wrote:
> > Apologies in advance if this is more of a "discuss" question, but it looks like a particular use-case shows a marked change in performance between recent versions of grep.
> >
> > A colleague mentioned a performance issue with grep to me, and its puzzling me a bit.
> > It turns out that he was using "grep -Fvif" to find lines in one file that are not present in another.
> >
> > Up until grep 2.18 this seems to work with linear performance and it takes less than 50ms to compare files up to about 20,000 lines.
> > With grep 2.19 and later, ever relatively small files are quite slow, runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s to compare 400 lines, 5s to compare 600 lines).
> >
> > I've shown my colleague how to use sort and diff (and "comm", which I think is vastly underrated), but it made me wonder if this is a reasonable thing to expect grep to be able to do, and whether such a performance drop should be seen as a bug.
> >
> > The way he was using it, he had two (unsorted) data sets (about 6000 rows in each), with most lines being common, and he was just using:
> >     grep -Fvif FILE1 FILE2
> > In his case, the older version of grep took way less than a second to run, but after he had upgraded his machine it took 20 minutes before running out of swap and seg faulting.
> >
> > In terms of comparing performance, I've found that the following works to compare performance (vary N to try different sized data files):
> >     N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; time grep -Fvif $F $F; rm $F
> 
> Thank you for reporting that.
> Interesting: that progression (time vs. increasing N) is clearly
> quadratic or worse when using a multibyte locale, but is linear with
> LC_ALL=C. I suspect when you run "locale", it reports something like
> en_US.utf8.
> 
> I.e., if you have no need for multi-byte matching, set LC_ALL=C, and
> that idiom will be very quick, even for a million lines:
> 
>   $ N=1000000; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1
> $N > $F; LC_ALL=C env time grep -Fvif $F $F; rm $F
>   0.78user 0.08system 0:00.86elapsed 100%CPU (0avgtext+0avgdata
> 131536maxresident)k
>   0inputs+0outputs (0major+32587minor)pagefaults 0swaps
> 
> Currently, I am not planning even to investigate this for the imminent release.

In grep 2.19 or later, grep -Fi is re-written to grep -i and corresponding
regular expression in only multibyte-locales.

Now, assume the case of N=2.

--
1 bottles of beer on the wall
2 bottles of beer on the wall
--

"grep -Fif $F" to "grep -e 'bottles of beer on the wall\|2 bottles of -
beer on the wall' $F"

By the way, we may expect to "grep -e '\(1\|2\) bottles of beer on the -
wall'", but grep does not do it.  It will cause slow down.

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Tue, 27 Oct 2015 09:07:02 GMT) Full text and rfc822 format available.

Message #14 received at 21763 <at> debbugs.gnu.org (full text, mbox):

From: "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>
To: Jim Meyering <jim <at> meyering.net>
Cc: "21763 <at> debbugs.gnu.org" <21763 <at> debbugs.gnu.org>
Subject: RE: bug#21763: poor performance since grep 2.19 when comparing
 files with grep
Date: Tue, 27 Oct 2015 09:06:24 +0000

> Thank you for reporting that.
> Interesting: that progression (time vs. increasing N) is clearly quadratic
> or worse when using a multibyte locale, but is linear with LC_ALL=C.
> I suspect when you run "locale", it reports something like en_US.utf8.

Yes, it's a UTF8 locale. I hadn't thought of that and I should have realised,
given that that's what the main change was in grep v2.19.

> I.e., if you have no need for multi-byte matching, set LC_ALL=C, and that 
> idiom will be very quick, even for a million lines:

Yes, you're totally right there. Thanks!

> Currently, I am not planning even to investigate this for the imminent
> release.

I totally agree!

Cheers,

Steve.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Wed, 21 Dec 2016 06:46:02 GMT) Full text and rfc822 format available.

Notification sent to "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>:
bug acknowledged by developer. (Wed, 21 Dec 2016 06:46:02 GMT) Full text and rfc822 format available.

Message #19 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 22357-done <at> debbugs.gnu.org, 21763-done <at> debbugs.gnu.org
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 Jim Meyering <jim <at> meyering.net>, "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>,
 JQK <jquan <at> redhat.com>, arnold <at> skeeve.com, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Paul Jackson <pj <at> usa.net>,
 Bruno Haible <bruno <at> clisp.org>,
 Ondřej Cífka <ondra <at> cifka.com>,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>
Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time
 cost
Date: Tue, 20 Dec 2016 21:17:01 -0800

[Message part 1 (text/plain, inline)]

I installed the attached patches into grep master. These fix the performance 
regressions noted at the start of Bug#22357. I see that the related performance 
problems noted in Bug#21763 seem to be fixed too, I expect because of Norihiro 
Tanaka's recent changes, so I'll boldly close both bug reports.

To some extent the attached patches restore the old behavior for grep -F, when 
grep is given two or more patterns. The patch doesn't change the underlying 
algorithms; it merely uses a different heuristic to decide whether to use the -F 
matcher. Although I wouldn't be surprised if the attached patches hurt 
performance in some cases, I didn't uncover any such cases in my performance 
testing, which I admit mostly consisted of running the examples in the 
abovementioned bug reports.

I'll leave Bug#22239 open, as I get the following performance figures 
(user+system CPU time) for the Bug#22239 benchmark, where list.txt is created by 
"aspell dump master | head -n 100000 >list.txt", and the grep commands all use 
the operands "-F -f list.txt /etc/passwd" in the en_US.utf8 locale on Fedora 24 
x86-64.

  no -i       -i       grep version
   0.25      0.33      2.16
   0.26     10.95      2.21
   0.11      2.90*     current master (including attached patches)

In the C locale, the current grep master is always significantly faster than 
grep 2.16 or 2.21 on the benchmark, so the only significant problem is the 
number marked "*". I ran the benchmarks on an AMD Phenom II X4 910e.

[0001-grep-simplify-line-counting-in-patterns.patch (text/x-diff, attachment)]

[0002-grep-simplify-matcher-configuration.patch (text/x-diff, attachment)]

[0003-grep-fix-performance-with-multiple-patterns.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Wed, 21 Dec 2016 23:24:02 GMT) Full text and rfc822 format available.

Message #22 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, Bruce Dubbs <bruce.dubbs <at> gmail.com>,
 22357-done <at> debbugs.gnu.org, sur-behoffski <sur_behoffski <at> grouse.com.au>,
 "L.A. Walsh" <law <at> tlinx.org>, Jim Meyering <jim <at> meyering.net>, "Bennett,
 Steve" <S.Bennett <at> lancaster.ac.uk>, JQK <jquan <at> redhat.com>, arnold <at> skeeve.com,
 22239 <at> debbugs.gnu.org, Trevor Cordes <gnu <at> tecnopolis.ca>,
 Paul Jackson <pj <at> usa.net>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22357: grep -f not only huge memory usage,
 but also huge time cost
Date: Thu, 22 Dec 2016 08:04:42 +0900

On Tue, 20 Dec 2016 21:17:01 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> I installed the attached patches into grep master. These fix the
> performance regressions noted at the start of Bug#22357. I see that
> the related performance problems noted in Bug#21763 seem to be fixed
> too, I expect because of Norihiro Tanaka's recent changes, so I'll
> boldly close both bug reports.
> 
> To some extent the attached patches restore the old behavior for grep
> -F, when grep is given two or more patterns. The patch doesn't change
> the underlying algorithms; it merely uses a different heuristic to
> decide whether to use the -F matcher. Although I wouldn't be
> surprised if the attached patches hurt performance in some cases, I
> didn't uncover any such cases in my performance testing, which I
> admit mostly consisted of running the examples in the abovementioned
> bug reports.
> 
> I'll leave Bug#22239 open, as I get the following performance figures
> (user+system CPU time) for the Bug#22239 benchmark, where list.txt is
> created by "aspell dump master | head -n 100000 >list.txt", and the
> grep commands all use the operands "-F -f list.txt /etc/passwd" in
> the en_US.utf8 locale on Fedora 24 x86-64.
> 
>    no -i       -i       grep version
>     0.25      0.33      2.16
>     0.26     10.95      2.21
>     0.11      2.90*     current master (including attached patches)
> 
> In the C locale, the current grep master is always significantly
> faster than grep 2.16 or 2.21 on the benchmark, so the only
> significant problem is the number marked "*". I ran the benchmarks on
> an AMD Phenom II X4 910e.

Thanks.

BTW, are you aware of extreme slowdown in the following cases after
third patch?

  yes $(printf %040d 0) | head -10000000 >inp
  printf '0\n1\n' >pat
  env LC_ALL=C src/grep -w -f pat inp

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Thu, 22 Dec 2016 17:11:03 GMT) Full text and rfc822 format available.

Message #25 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 21763-done <at> debbugs.gnu.org,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>, JQK <jquan <at> redhat.com>,
 Aharon Robbins <arnold <at> skeeve.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Paul Jackson <pj <at> usa.net>,
 Bruno Haible <bruno <at> clisp.org>, Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22357: grep -f not only huge memory usage,
 but also huge time cost
Date: Thu, 22 Dec 2016 03:14:16 -0800

On Wed, Dec 21, 2016 at 3:04 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>   yes $(printf %040d 0) | head -10000000 >inp
>   printf '0\n1\n' >pat
>   env LC_ALL=C src/grep -w -f pat inp

Thanks for mentioning that.
Here is some actual data (Fedora 25 on an i7-4770S):

for i in $(env ls -1dv /p/p/grep-*/bin/grep) src/grep; do env LC_ALL=C
time -f "%e $i" $i -w -f pat inp; done
1.05 /p/p/grep-2.3/bin/grep
0.91 /p/p/grep-2.4.1/bin/grep
0.91 /p/p/grep-2.4.2/bin/grep
0.91 /p/p/grep-2.4/bin/grep
0.94 /p/p/grep-2.5.1/bin/grep
0.94 /p/p/grep-2.5.3/bin/grep
0.94 /p/p/grep-2.5.4/bin/grep
0.95 /p/p/grep-2.5/bin/grep
0.90 /p/p/grep-2.6.1/bin/grep
1.03 /p/p/grep-2.6.2/bin/grep
0.90 /p/p/grep-2.6.3/bin/grep
0.90 /p/p/grep-2.6/bin/grep
0.90 /p/p/grep-2.7/bin/grep
0.90 /p/p/grep-2.8/bin/grep
0.90 /p/p/grep-2.9/bin/grep
0.90 /p/p/grep-2.10/bin/grep
0.89 /p/p/grep-2.11/bin/grep
0.90 /p/p/grep-2.12/bin/grep
0.90 /p/p/grep-2.13/bin/grep
0.89 /p/p/grep-2.14/bin/grep
0.90 /p/p/grep-2.15/bin/grep
0.90 /p/p/grep-2.16/bin/grep
0.89 /p/p/grep-2.17/bin/grep
0.90 /p/p/grep-2.18/bin/grep
0.90 /p/p/grep-2.19/bin/grep
0.90 /p/p/grep-2.20/bin/grep
0.86 /p/p/grep-2.21/bin/grep
0.90 /p/p/grep-2.22/bin/grep
0.91 /p/p/grep-2.23/bin/grep
0.91 /p/p/grep-2.24/bin/grep
0.91 /p/p/grep-2.25/bin/grep
0.25 /p/p/grep-2.26/bin/grep
13.07 /p/p/grep-2.27/bin/grep
37.49 src/grep

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Sat, 24 Dec 2016 01:39:02 GMT) Full text and rfc822 format available.

Message #28 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ondřej Cífka <ondra <at> cifka.com>
Subject: Re: bug#22239: bug#22357: grep -f not only huge memory usage, but
 also huge time cost
Date: Fri, 23 Dec 2016 17:38:42 -0800

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote:
> are you aware of extreme slowdown in the following cases after third patch?
>
>   yes $(printf %040d 0) | head -10000000 >inp
>   printf '0\n1\n' >pat
>   env LC_ALL=C src/grep -w -f pat inp

No. Thanks, I hadn't considered that possibility. I looked into the slowdown and 
installed the attached patches, which cause 'grep' to run about as fast on this 
test case as grep 2.25 (though not as fast as grep 2.26). The main fix is in 
patch 5. On my platform:

  -------grep version------
   v2.25  v2.26  v2.27 master     locale      command
    1.21   0.69  24.95   1.22     C           grep -w -f pat inp
  207.36 203.15 202.03   1.22     en_US.utf8  grep -w -f pat inp
    1.21   0.69  25.95   0.85     C           grep -w -f pat inp -F
   66.33  68.07  67.21   1.22     en_US.utf8  grep -w -f pat inp -F

All numbers are user+system CPU seconds on Fedora 24 x86-64 (AMD Phenom II X4 
910e). "master" means after the attached patches are installed.

Perhaps we can fiddle with the heuristics a bit so that v2.26 is not 
significantly faster than the master in the C locale.

[0001-maint-rewrite-to-avoid-some-macros.patch (text/x-diff, attachment)]

[0002-grep-remove-C-label.patch (text/x-diff, attachment)]

[0003-grep-simplify-Fexecute.patch (text/x-diff, attachment)]

[0004-grep-specialize-word-finding-functions.patch (text/x-diff, attachment)]

[0005-grep-speed-up-wf-in-C-locale.patch (text/x-diff, attachment)]

[0006-grep-standardize-on-localeinfo.multibyte.patch (text/x-diff, attachment)]

[0007-grep-improve-word-checking-with-UTF-8.patch (text/x-diff, attachment)]

[0008-grep-fix-comment-in-searchutils.c.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Mon, 26 Dec 2016 12:08:02 GMT) Full text and rfc822 format available.

Message #31 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22239: bug#22357: grep -f not only huge memory usage,
 but also huge time cost
Date: Mon, 26 Dec 2016 21:06:55 +0900

On Fri, 23 Dec 2016 17:38:42 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> No. Thanks, I hadn't considered that possibility. I looked into the
> slowdown and installed the attached patches, which cause 'grep' to
> run about as fast on this test case as grep 2.25 (though not as fast
> as grep 2.26). The main fix is in patch 5. On my platform:

Hmm, how about the following test cases, although it is extreame?

$ cat pat
0
00 0
00 00 0
00 00 00 0
00 00 00 00 0
00 00 00 00 00 0
00 00 00 00 00 00 0
00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 00 0
$ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' |
  head -1000000 >inp
$ time -p env LC_ALL=C src/grep -w -f pat inp

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Mon, 26 Dec 2016 13:02:02 GMT) Full text and rfc822 format available.

Message #34 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 21763-done <at> debbugs.gnu.org,
 22357-done <at> debbugs.gnu.org, sur-behoffski <sur_behoffski <at> grouse.com.au>,
 "L.A. Walsh" <law <at> tlinx.org>, JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22239: bug#22357: grep -f not only huge memory usage, but
 also huge time cost
Date: Mon, 26 Dec 2016 14:01:16 +0100

On Mon, Dec 26, 2016 at 1:06 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Fri, 23 Dec 2016 17:38:42 -0800
> Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
>> No. Thanks, I hadn't considered that possibility. I looked into the
>> slowdown and installed the attached patches, which cause 'grep' to
>> run about as fast on this test case as grep 2.25 (though not as fast
>> as grep 2.26). The main fix is in patch 5. On my platform:
>
> Hmm, how about the following test cases, although it is extreame?
>
> $ cat pat
> 0
> 00 0
> 00 00 0
> 00 00 00 0
> 00 00 00 00 0
> 00 00 00 00 00 0
> 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 00 00 0
> $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' |
>   head -1000000 >inp
> $ time -p env LC_ALL=C src/grep -w -f pat inp

That took 7 seconds for me.

Here is one that is O(N^2) in the length of the literal search string:
(the search strings are sequences of '0's, with lengths from 10k.. 320k):

$ n=10000; for i in $(seq 6); do env time -f "$n %e" src/grep -f
<(printf %0${n}d 0) <<<1; ((n *= 2)); done
10000 0.44
20000 1.44
40000 5.41
80000 21.27
160000 97.27
320000 426.72

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Mon, 26 Dec 2016 20:08:03 GMT) Full text and rfc822 format available.

Message #37 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Mon, 26 Dec 2016 12:07:49 -0800

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote:
> Hmm, how about the following test cases, although it is extreame?

I don't think we need to worry about performance for the case when -w is given, 
and a pattern matches data that contains non-word characters. In practice, such 
cases are rare. I expect that most users would be surprised that -w can match 
non-word characters, and that users wouldn't object to -w rejecting such matches 
(if this wouldn't hurt performance significantly).

While looking into this I did find a very small performance tweak for the test 
case, and installed the attached.

[0001-grep-minor-performance-tweak-for-pure-functions.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Wed, 28 Dec 2016 00:22:02 GMT) Full text and rfc822 format available.

Message #40 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Wed, 28 Dec 2016 09:21:02 +0900

[Message part 1 (text/plain, inline)]

On Mon, 26 Dec 2016 12:07:49 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> > Hmm, how about the following test cases, although it is extreame?
> 
> I don't think we need to worry about performance for the case when -w
> is given, and a pattern matches data that contains non-word
> characters. In practice, such cases are rare. I expect that most
> users would be surprised that -w can match non-word characters, and
> that users wouldn't object to -w rejecting such matches (if this
> wouldn't hurt performance significantly).
> 
> While looking into this I did find a very small performance tweak for
> the test case, and installed the attached.

Thanks.

BTW, with multiple patterns in current master, former uses fgrep matcher,
and later users grep matcher.  I think that it is not reasonable.

  env LC_ALL=C grep -w -f pat inp

  env LC_ALL=C grep -F -w -f pat inp

So I wrote the patch to use fgrep matcher for both.

[0001-grep-imorove-performance-with-multiple-patterns.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Wed, 28 Dec 2016 06:38:02 GMT) Full text and rfc822 format available.

Message #43 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Tue, 27 Dec 2016 22:37:25 -0800

Norihiro Tanaka wrote:
> So I wrote the patch to use fgrep matcher for both.

Thanks, I installed that after tweaking the commit message and omitting 
unnecessary parens.

Information forwarded to bug-grep <at> gnu.org:
bug#21763; Package grep. (Wed, 28 Dec 2016 14:34:02 GMT) Full text and rfc822 format available.

Message #46 received at 21763-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Wed, 28 Dec 2016 23:33:34 +0900

On Tue, 27 Dec 2016 22:37:25 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> > So I wrote the patch to use fgrep matcher for both.
> 
> Thanks, I installed that after tweaking the commit message and omitting unnecessary parens.

Thanks, I confirmed it.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 26 Jan 2017 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 198 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #21763 poor performance since grep 2.19 when comparing files with grep

GNU bug report logs - #21763
poor performance since grep 2.19 when comparing files with grep