GNU bug report logs -
#21763
poor performance since grep 2.19 when comparing files with grep
Previous Next
Full log
Message #11 received at 21763 <at> debbugs.gnu.org (full text, mbox):
On Mon, 26 Oct 2015 11:39:54 -0700
Jim Meyering <jim <at> meyering.net> wrote:
> On Mon, Oct 26, 2015 at 5:54 AM, Bennett, Steve
> <S.Bennett <at> lancaster.ac.uk> wrote:
> > Apologies in advance if this is more of a "discuss" question, but it looks like a particular use-case shows a marked change in performance between recent versions of grep.
> >
> > A colleague mentioned a performance issue with grep to me, and its puzzling me a bit.
> > It turns out that he was using "grep -Fvif" to find lines in one file that are not present in another.
> >
> > Up until grep 2.18 this seems to work with linear performance and it takes less than 50ms to compare files up to about 20,000 lines.
> > With grep 2.19 and later, ever relatively small files are quite slow, runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s to compare 400 lines, 5s to compare 600 lines).
> >
> > I've shown my colleague how to use sort and diff (and "comm", which I think is vastly underrated), but it made me wonder if this is a reasonable thing to expect grep to be able to do, and whether such a performance drop should be seen as a bug.
> >
> > The way he was using it, he had two (unsorted) data sets (about 6000 rows in each), with most lines being common, and he was just using:
> > grep -Fvif FILE1 FILE2
> > In his case, the older version of grep took way less than a second to run, but after he had upgraded his machine it took 20 minutes before running out of swap and seg faulting.
> >
> > In terms of comparing performance, I've found that the following works to compare performance (vary N to try different sized data files):
> > N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; time grep -Fvif $F $F; rm $F
>
> Thank you for reporting that.
> Interesting: that progression (time vs. increasing N) is clearly
> quadratic or worse when using a multibyte locale, but is linear with
> LC_ALL=C. I suspect when you run "locale", it reports something like
> en_US.utf8.
>
> I.e., if you have no need for multi-byte matching, set LC_ALL=C, and
> that idiom will be very quick, even for a million lines:
>
> $ N=1000000; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1
> $N > $F; LC_ALL=C env time grep -Fvif $F $F; rm $F
> 0.78user 0.08system 0:00.86elapsed 100%CPU (0avgtext+0avgdata
> 131536maxresident)k
> 0inputs+0outputs (0major+32587minor)pagefaults 0swaps
>
> Currently, I am not planning even to investigate this for the imminent release.
In grep 2.19 or later, grep -Fi is re-written to grep -i and corresponding
regular expression in only multibyte-locales.
Now, assume the case of N=2.
--
1 bottles of beer on the wall
2 bottles of beer on the wall
--
"grep -Fif $F" to "grep -e 'bottles of beer on the wall\|2 bottles of -
beer on the wall' $F"
By the way, we may expect to "grep -e '\(1\|2\) bottles of beer on the -
wall'", but grep does not do it. It will cause slow down.
This bug report was last modified 8 years and 201 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.