GNU bug report logs -
#31074
Grep -i is slow
Previous Next
Reported by: Geoff Kuenning <geoff <at> cs.hmc.edu>
Date: Fri, 6 Apr 2018 05:33:02 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 31074 in the body.
You can then email your comments to 31074 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#31074
; Package
grep
.
(Fri, 06 Apr 2018 05:33:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Geoff Kuenning <geoff <at> cs.hmc.edu>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Fri, 06 Apr 2018 05:33:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
The -i switch is slow when searching large files. I haven't dug into
the code in detail, although it seems that dfa.c is trying to build an
intelligent case-agnostic DFA when -i is specified. But that doesn't
seem to be working. Perhaps that's because I'm running the UTF-8
character set? Although I don't see why that would affect the DFA.
Here's an example of timing several greps of 151M file named "rawindex",
which has already been read so that it is in the file system buffer cache.
In each case the grep finds a single match, since the matched line is
actually all lowercase; for privacy, I have omitted the match lines
themselves.
A straightforward match takes only 199 ms even with two .* patterns.
Adding -i blows that up to 6917 ms. Finally when I write an explicit
case-agnostic pattern to force how the DFA is built, it does run slower
(532 ms) but it's nowhere near the -i time.
mallet:514> time grep outgoing.*harris.*dcraw rawindex
real 0m0.199s
user 0m0.170s
sys 0m0.029s
mallet:515> time grep -i outgoing.*harris.*dcraw rawindex
real 0m6.917s
user 0m6.879s
sys 0m0.036s
mallet:516> time grep [Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]' rawindex
real 0m0.532s
user 0m0.491s
sys 0m0.040s
--
Geoff Kuenning geoff <at> cs.hmc.edu http://www.cs.hmc.edu/~geoff/
The DMCA criminalizes curiosity. It would put Susie in jail for
taking her stereo apart to see how it works.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#31074
; Package
grep
.
(Fri, 06 Apr 2018 19:36:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 31074 <at> debbugs.gnu.org (full text, mbox):
It sounds like you've run into a bug that was fixed in grep 2.18
(2014-02-20). Please try grep 3.1, the current version. If that doesn't
work, it'd be helpful if you could give us a reproducible test case.
Here's how I tried (and failed) to reproduce the problem on Fedora 27
x86-64, which has grep 3.1:
$ shuf -i 1-20000000 >rawindex
$ ls -l rawindex
-rw-r--r--. 1 eggert eggert 168888897 Apr 6 12:30 rawindex
$ time grep outgoing.*harris.*dcraw rawindex
real 0m0.069s
user 0m0.013s
sys 0m0.055s
$ time grep -i outgoing.*harris.*dcraw rawindex
real 0m0.418s
user 0m0.368s
sys 0m0.048s
$ time grep
'[Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]'
rawindex
real 0m0.416s
user 0m0.357s
sys 0m0.058s
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Information forwarded
to
bug-grep <at> gnu.org
:
bug#31074
; Package
grep
.
(Mon, 09 Apr 2018 04:57:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 31074 <at> debbugs.gnu.org (full text, mbox):
Nice catch! It looks like I'm running grep 2.16. No clue why my
distro is more than four years behind. I'll whine at them. In
the meantime, let's close this bug; I can reopen it in the
unlikely event that it's still there when I get around to
upgrading.
> It sounds like you've run into a bug that was fixed in grep 2.18
> (2014-02-20). Please try grep 3.1, the current version. If that
> doesn't work, it'd be helpful if you could give us a
> reproducible test
> case. Here's how I tried (and failed) to reproduce the problem
> on
> Fedora 27 x86-64, which has grep 3.1:
>
> $ shuf -i 1-20000000 >rawindex
> $ ls -l rawindex
> -rw-r--r--. 1 eggert eggert 168888897 Apr 6 12:30 rawindex
> $ time grep outgoing.*harris.*dcraw rawindex
>
> real 0m0.069s
> user 0m0.013s
> sys 0m0.055s
> $ time grep -i outgoing.*harris.*dcraw rawindex
>
> real 0m0.418s
> user 0m0.368s
> sys 0m0.048s
> $ time grep
> [Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]'
> rawindex
>
> real 0m0.416s
> user 0m0.357s
> sys 0m0.058s
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=
>
--
Geoff Kuenning geoff <at> cs.hmc.edu
http://www.cs.hmc.edu/~geoff/
I have always wished for my computer to be as easy to use as my
telephone; my wish has come true because I can no longer figure
out
how to use my telephone.
-- Bjarne Stroustrup
bug closed, send any further explanations to
31074 <at> debbugs.gnu.org and Geoff Kuenning <geoff <at> cs.hmc.edu>
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Mon, 09 Apr 2018 18:39:01 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 08 May 2018 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 7 years and 46 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.