GNU bug report logs - #21763
poor performance since grep 2.19 when comparing files with grep

Previous Next

Package: grep;

Reported by: "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>

Date: Mon, 26 Oct 2015 14:19:03 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>
Subject: bug#21763: closed (Re: bug#22357: grep -f not only huge memory
 usage, but also huge time cost)
Date: Wed, 21 Dec 2016 06:46:02 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#21763: poor performance since grep 2.19 when comparing files with grep

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 21763 <at> debbugs.gnu.org.

-- 
21763: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=21763
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 22357-done <at> debbugs.gnu.org, 21763-done <at> debbugs.gnu.org
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 Jim Meyering <jim <at> meyering.net>, "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>,
 JQK <jquan <at> redhat.com>, arnold <at> skeeve.com, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Paul Jackson <pj <at> usa.net>,
 Bruno Haible <bruno <at> clisp.org>,
 Ondřej Cífka <ondra <at> cifka.com>,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>
Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time
 cost
Date: Tue, 20 Dec 2016 21:17:01 -0800
[Message part 3 (text/plain, inline)]
I installed the attached patches into grep master. These fix the performance 
regressions noted at the start of Bug#22357. I see that the related performance 
problems noted in Bug#21763 seem to be fixed too, I expect because of Norihiro 
Tanaka's recent changes, so I'll boldly close both bug reports.

To some extent the attached patches restore the old behavior for grep -F, when 
grep is given two or more patterns. The patch doesn't change the underlying 
algorithms; it merely uses a different heuristic to decide whether to use the -F 
matcher. Although I wouldn't be surprised if the attached patches hurt 
performance in some cases, I didn't uncover any such cases in my performance 
testing, which I admit mostly consisted of running the examples in the 
abovementioned bug reports.

I'll leave Bug#22239 open, as I get the following performance figures 
(user+system CPU time) for the Bug#22239 benchmark, where list.txt is created by 
"aspell dump master | head -n 100000 >list.txt", and the grep commands all use 
the operands "-F -f list.txt /etc/passwd" in the en_US.utf8 locale on Fedora 24 
x86-64.

  no -i       -i       grep version
   0.25      0.33      2.16
   0.26     10.95      2.21
   0.11      2.90*     current master (including attached patches)

In the C locale, the current grep master is always significantly faster than 
grep 2.16 or 2.21 on the benchmark, so the only significant problem is the 
number marked "*". I ran the benchmarks on an AMD Phenom II X4 910e.
[0001-grep-simplify-line-counting-in-patterns.patch (text/x-diff, attachment)]
[0002-grep-simplify-matcher-configuration.patch (text/x-diff, attachment)]
[0003-grep-fix-performance-with-multiple-patterns.patch (text/x-diff, attachment)]
[Message part 7 (message/rfc822, inline)]
From: "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>
To: "bug-grep <at> gnu.org" <bug-grep <at> gnu.org>
Subject: poor performance since grep 2.19 when comparing files with grep
Date: Mon, 26 Oct 2015 12:54:01 +0000
Apologies in advance if this is more of a "discuss" question, but it looks like a particular use-case shows a marked change in performance between recent versions of grep.

A colleague mentioned a performance issue with grep to me, and its puzzling me a bit.
It turns out that he was using "grep -Fvif" to find lines in one file that are not present in another.

Up until grep 2.18 this seems to work with linear performance and it takes less than 50ms to compare files up to about 20,000 lines.
With grep 2.19 and later, ever relatively small files are quite slow, runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s to compare 400 lines, 5s to compare 600 lines).

I've shown my colleague how to use sort and diff (and "comm", which I think is vastly underrated), but it made me wonder if this is a reasonable thing to expect grep to be able to do, and whether such a performance drop should be seen as a bug.

The way he was using it, he had two (unsorted) data sets (about 6000 rows in each), with most lines being common, and he was just using:
    grep -Fvif FILE1 FILE2
In his case, the older version of grep took way less than a second to run, but after he had upgraded his machine it took 20 minutes before running out of swap and seg faulting. 

In terms of comparing performance, I've found that the following works to compare performance (vary N to try different sized data files):
    N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; time grep -Fvif $F $F; rm $F

Steve.



This bug report was last modified 8 years and 148 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.