GNU bug report logs - #31074
Grep -i is slow

Previous Next

Package: grep;

Reported by: Geoff Kuenning <geoff <at> cs.hmc.edu>

Date: Fri, 6 Apr 2018 05:33:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 31074 in the body.
You can then email your comments to 31074 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#31074; Package grep. (Fri, 06 Apr 2018 05:33:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Geoff Kuenning <geoff <at> cs.hmc.edu>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 06 Apr 2018 05:33:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Geoff Kuenning <geoff <at> cs.hmc.edu>
To: bug-grep <at> gnu.org
Subject: Grep -i is slow
Date: Thu, 05 Apr 2018 22:32:41 -0700
The -i switch is slow when searching large files.  I haven't dug into
the code in detail, although it seems that dfa.c is trying to build an
intelligent case-agnostic DFA when -i is specified.  But that doesn't
seem to be working.  Perhaps that's because I'm running the UTF-8
character set?  Although I don't see why that would affect the DFA.

Here's an example of timing several greps of 151M file named "rawindex",
which has already been read so that it is in the file system buffer cache.
In each case the grep finds a single match, since the matched line is
actually all lowercase; for privacy, I have omitted the match lines
themselves.

A straightforward match takes only 199 ms even with two .* patterns.
Adding -i blows that up to 6917 ms.  Finally when I write an explicit
case-agnostic pattern to force how the DFA is built, it does run slower
(532 ms) but it's nowhere near the -i time.

mallet:514> time grep outgoing.*harris.*dcraw rawindex 

real    0m0.199s
user    0m0.170s
sys     0m0.029s

mallet:515> time grep -i outgoing.*harris.*dcraw rawindex 

real    0m6.917s
user    0m6.879s
sys     0m0.036s

mallet:516> time grep [Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]' rawindex

real    0m0.532s
user    0m0.491s
sys     0m0.040s
-- 
    Geoff Kuenning   geoff <at> cs.hmc.edu   http://www.cs.hmc.edu/~geoff/

The DMCA criminalizes curiosity.  It would put Susie in jail for
taking her stereo apart to see how it works.




Information forwarded to bug-grep <at> gnu.org:
bug#31074; Package grep. (Fri, 06 Apr 2018 19:36:01 GMT) Full text and rfc822 format available.

Message #8 received at 31074 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Geoff Kuenning <geoff <at> cs.hmc.edu>, 31074 <at> debbugs.gnu.org
Subject: Re: bug#31074: Grep -i is slow
Date: Fri, 6 Apr 2018 12:35:30 -0700
It sounds like you've run into a bug that was fixed in grep 2.18 
(2014-02-20). Please try grep 3.1, the current version. If that doesn't 
work, it'd be helpful if you could give us a reproducible test case. 
Here's how I tried (and failed) to reproduce the problem on Fedora 27 
x86-64, which has grep 3.1:

$ shuf -i 1-20000000 >rawindex
$ ls -l rawindex
-rw-r--r--. 1 eggert eggert 168888897 Apr  6 12:30 rawindex
$ time grep outgoing.*harris.*dcraw rawindex

real    0m0.069s
user    0m0.013s
sys     0m0.055s
$ time grep -i outgoing.*harris.*dcraw rawindex

real    0m0.418s
user    0m0.368s
sys     0m0.048s
$ time grep 
'[Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]' 
rawindex

real    0m0.416s
user    0m0.357s
sys     0m0.058s
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=





Information forwarded to bug-grep <at> gnu.org:
bug#31074; Package grep. (Mon, 09 Apr 2018 04:57:02 GMT) Full text and rfc822 format available.

Message #11 received at 31074 <at> debbugs.gnu.org (full text, mbox):

From: Geoff Kuenning <geoff <at> cs.hmc.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 31074 <at> debbugs.gnu.org
Subject: Re: bug#31074: Grep -i is slow
Date: Sun, 08 Apr 2018 21:56:25 -0700
Nice catch!  It looks like I'm running grep 2.16.  No clue why my 
distro is more than four years behind.  I'll whine at them.  In 
the meantime, let's close this bug; I can reopen it in the 
unlikely event that it's still there when I get around to 
upgrading.

> It sounds like you've run into a bug that was fixed in grep 2.18
> (2014-02-20). Please try grep 3.1, the current version. If that
> doesn't work, it'd be helpful if you could give us a 
> reproducible test
> case. Here's how I tried (and failed) to reproduce the problem 
> on
> Fedora 27 x86-64, which has grep 3.1:
>
> $ shuf -i 1-20000000 >rawindex
> $ ls -l rawindex
> -rw-r--r--. 1 eggert eggert 168888897 Apr  6 12:30 rawindex
> $ time grep outgoing.*harris.*dcraw rawindex
>
> real    0m0.069s
> user    0m0.013s
> sys     0m0.055s
> $ time grep -i outgoing.*harris.*dcraw rawindex
>
> real    0m0.418s
> user    0m0.368s
> sys     0m0.048s
> $ time grep
> [Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]'
> rawindex
>
> real    0m0.416s
> user    0m0.357s
> sys     0m0.058s
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=
>

-- 
   Geoff Kuenning   geoff <at> cs.hmc.edu 
   http://www.cs.hmc.edu/~geoff/

I have always wished for my computer to be as easy to use as my
telephone; my wish has come true because I can no longer figure 
out
how to use my telephone.
		-- Bjarne Stroustrup




bug closed, send any further explanations to 31074 <at> debbugs.gnu.org and Geoff Kuenning <geoff <at> cs.hmc.edu> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 09 Apr 2018 18:39:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 08 May 2018 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 46 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.