GNU bug report logs -
#44754
Extreme performance degradation in GNU grep 3.4 / 3.6
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
On Thu, Nov 19, 2020 at 7:32 PM Frank Heckenbach
<f.heckenbach <at> fh-soft.de> wrote:
> I have a use case where I run grep with a large number of search
> patterns on a large text file. It works well with grep-3.3, but with
> grep-3.4 it quickly burned through GBs of memory and almost locked
> up my system due to swapping.
>
> To avoid attaching those large files, I could mostly reproduce the
> effects like this:
>
> ulimit -d 5000000 # avoid system lockup due to excessive swapping
> export LC_ALL=C # make sure no Unicode case conversions are needed
>
> % time ./grep-3.3 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
>
> real 0m0.054s
> user 0m0.048s
> sys 0m0.012s
>
> % time ./grep-3.4 -Fwif <(seq 30000 | tr 0-9 A-J) <<<foo
> ./grep-3.4: Memory exhausted
> Aborted
>
> real 0m1.291s
> user 0m0.696s
> sys 0m0.599s
>
> % time ./grep-3.6 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
>
> real 0m13.162s
> user 0m12.955s
> sys 0m0.211s
>
> grep-3.3 behaves well, even with much larger number of patterns.
> Time seems to grow linearly, and memory usage is constant.
>
> grep-3.4 behaves the worst of these 3 versions. Even with just 30000
> patterns it exceeds the ulimit of 5 GB.
>
> grep-3.6 behaves a bit better than 3.4, but still bad. Time seems to
> be quadratic in the number of patterns, and though memory usage in
> this case seems to be almost constant, in my actual use case it also
> runs out of memory where grep-3.3 works well with just a few 100 MB
> used.
>
> Without "-i", grep-3.4 seems to run as fast as grep-3.3, but
> grep-3.6 is almost as slow as with "-i".
>
> So there might actually be two different issues here, one that
> affects 3.4 with "-i" and one that affects 3.6 with or without "-i".
Thank you for the fine bug report.
The grep-3.6 bug you've exposed is due to the fact that your input
triggers excessive hash collisions when using the code modeled after
gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
take O(N^2) time for N patterns. In the attached, I've switched grep
to use the djb2 hash function, and that resolves the problem. I'll
also add a NEWS entry and a test before pushing this.
[0001-grep-avoid-performance-regression-with-many-patterns.patch (application/octet-stream, attachment)]
This bug report was last modified 4 years and 225 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.