GNU bug report logs -
#24009
[PATCH] grep: use fastmap in regex
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Sat, 16 Jul 2016 16:59:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24009 in the body.
You can then email your comments to 24009 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#24009
; Package
grep
.
(Sat, 16 Jul 2016 16:59:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sat, 16 Jul 2016 16:59:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
sed and gawk use fastmap in regex, but grep does not. By using fastmap,
I expect that grep speeds up for patterns as regex is used.
before:
$ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
real 7.83
user 7.62
sys 0.07
after:
$ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
real 0.46
user 0.38
sys 0.07
However, if grep uses fastmap, fails in case-fold-titlecase test. It
means that grep's behavior differ from sed and gawk, as they use fastmap,
although it seems to be a bug in regex.
[0001-grep-use-fastmap-in-regex.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24009
; Package
grep
.
(Sat, 16 Jul 2016 20:08:01 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi Norihiro.
> sed and gawk use fastmap in regex, but grep does not. By using fastmap,
> I expect that grep speeds up for patterns as regex is used.
>
> before:
> $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
> real 7.83
> user 7.62
> sys 0.07
>
> after:
> $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
> real 0.46
> user 0.38
> sys 0.07
>
> However, if grep uses fastmap, fails in case-fold-titlecase test. It
> means that grep's behavior differ from sed and gawk, as they use fastmap,
> although it seems to be a bug in regex.
Wow, that is a spectacular speed improvement. Since I use grep with regex
patterns heavily in some of my scripts I could not resist to make some
first simple tests (including your example pattern with a back reference).
The non-representative results using grep 2.25 shows a gain of a factor
5-10 (while the unpatched self-compiled grep 2.25 itself was already a
factor 1.4-2.8 faster than the grep 2.16 offered by the OS (OpenSUSE Leap
42.1). At least in my tests all the grep outputs were identical.
By the way I had to remove one of the two "=" in your patch otherwise gcc
issued an error (but caution, I am a C-layman).
Regards
Jens
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24009
; Package
grep
.
(Sun, 17 Jul 2016 00:56:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 24009 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Sat, 16 Jul 2016 22:06:53 +0200 (CEST)
"Jens Schleusener" <Jens.Schleusener <at> t-online.de> wrote:
> Wow, that is a spectacular speed improvement. Since I use grep with
> regex patterns heavily in some of my scripts I could not resist to
> make some first simple tests (including your example pattern with a
> back reference). The non-representative results using grep 2.25 shows
> a gain of a factor 5-10 (while the unpatched self-compiled grep 2.25
> itself was already a factor 1.4-2.8 faster than the grep 2.16 offered
> by the OS (OpenSUSE Leap 42.1). At least in my tests all the grep
> outputs were identical.
I believe that cases to speed up by this patch is not so much, as grep
makes a lot of other optimizations. In fact, I spent a little time to
make a test case to demonstrate to speed up by this patch. So I have an
interest with what kind of test cases you could confirm to speed up by
this patch.
> By the way I had to remove one of the two "=" in your patch otherwise
> gcc issued an error (but caution, I am a C-layman).
Thanks, I fixed it. I made a mistake before sending the patch. Of
course, "=" should be one.
[0001-grep-use-fastmap-in-regex.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24009
; Package
grep
.
(Sun, 17 Jul 2016 10:22:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 24009 <at> debbugs.gnu.org (full text, mbox):
On Sun, 17 Jul 2016, Norihiro Tanaka wrote:
> On Sat, 16 Jul 2016 22:06:53 +0200 (CEST)
> "Jens Schleusener" <Jens.Schleusener <at> t-online.de> wrote:
>
>> Wow, that is a spectacular speed improvement. Since I use grep with
>> regex patterns heavily in some of my scripts I could not resist to
>> make some first simple tests (including your example pattern with a
>> back reference). The non-representative results using grep 2.25 shows
>> a gain of a factor 5-10 (while the unpatched self-compiled grep 2.25
>> itself was already a factor 1.4-2.8 faster than the grep 2.16 offered
>> by the OS (OpenSUSE Leap 42.1). At least in my tests all the grep
>> outputs were identical.
>
> I believe that cases to speed up by this patch is not so much, as grep
> makes a lot of other optimizations. In fact, I spent a little time to
> make a test case to demonstrate to speed up by this patch. So I have an
> interest with what kind of test cases you could confirm to speed up by
> this patch.
First, as I mentioned in my mail, my "tests" are non-representative
and done on a server system that runs also other jobs just to get a first
impression.
Currently I have redone some of the tests here are the more detailed
results (hopefully readable in this mail).
OS: OpenSUSE Leap 42.1 (64-bit)
gcc: version 4.8.5 (SUSE Linux)
Main testfile was an Apache access log file (in nearly combined log
format) with a size of 157 MB and 673623 lines that looks like:
66.249.78.85 - - [16/Jul/2016:00:00:02 +0200] "GET /dox/phpMyAdmin-4.6.1-all-languages/namespacePMA_1_1libraries_1_1properties_1_1options_1_1items_1_1TextPropertyItem.html HTTP/1.1" 410 1977 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 - -
The test command was
time -p env LC_ALL=en_US.UTF-8 grep pattern logfile > /dev/shm/out
The tested grep versions:
1: GNU grep 2.16 (OpenSuSE Leap 42.1)
2: GNU grep 2.25 (self-compiled)
3: GNU grep 2.25 (self-compiled with 0001-grep-use-fastmap-in-regex.patch)
"self-compiled" means just "./configure; make; make install".
Times are in seconds (rounded; so the sum user+sys sometimes is different
from the real value) and at least 2 measurements were done. Naturally the
output time and the current load may have had an influence (but probably
not a drastic one).
pattern vrs real user sys
------------ --- ---- ---- ---
\([a-b]\)\1
1: 24.3 23.6 0.6
2: 18.2 18.2 0.0
3: 1.9 1.9 0.0
\([a-b]\)
1: 9.4 8.8 0.6
2: 3.4 3.4 0.0
3: 0.7 0.6 0.1
[a-b]
1: 8.8 8.1 0.7
2: 3.2 3.1 0.0
3: 0.4 0.4 0.1
"GET /dox/.*-[0-9\.]*.*/.*\.html.* HTTP/1.1" 410
1: 7.62 7.60 0.02
2: 0.33 0.32 0.01
3: 0.29 0.28 0.01
No idea if that values are meaningful but as a layman I have the
impression grep version 3 is faster than 2 and 2 is faster than 1 ;-)
>> By the way I had to remove one of the two "=" in your patch otherwise
>> gcc issued an error (but caution, I am a C-layman).
>
> Thanks, I fixed it. I made a mistake before sending the patch. Of
> course, "=" should be one.
No problem, I used exact your new patch version.
Regards
Jens
P.S.: OT and you are probably the wrong address: I would like to see
some "agrep" functionality in GNU grep.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24009
; Package
grep
.
(Sun, 17 Jul 2016 12:29:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 24009 <at> debbugs.gnu.org (full text, mbox):
On Sun, 17 Jul 2016 12:21:46 +0200 (CEST)
Jens Schleusener <Jens.Schleusener <at> t-online.de> wrote:
> First, as I mentioned in my mail, my "tests" are non-representative
> and done on a server system that runs also other jobs just to get a first impression.
>
> Currently I have redone some of the tests here are the more detailed results (hopefully readable in this mail).
>
> OS: OpenSUSE Leap 42.1 (64-bit)
> gcc: version 4.8.5 (SUSE Linux)
>
> Main testfile was an Apache access log file (in nearly combined log format) with a size of 157 MB and 673623 lines that looks like:
>
> 66.249.78.85 - - [16/Jul/2016:00:00:02 +0200] "GET /dox/phpMyAdmin-4.6.1-all-languages/namespacePMA_1_1libraries_1_1properties_1_1options_1_1items_1_1TextPropertyItem.html HTTP/1.1" 410 1977 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 - -
>
> The test command was
>
> time -p env LC_ALL=en_US.UTF-8 grep pattern logfile > /dev/shm/out
>
> The tested grep versions:
>
> 1: GNU grep 2.16 (OpenSuSE Leap 42.1)
> 2: GNU grep 2.25 (self-compiled)
> 3: GNU grep 2.25 (self-compiled with 0001-grep-use-fastmap-in-regex.patch)
>
> "self-compiled" means just "./configure; make; make install".
>
> Times are in seconds (rounded; so the sum user+sys sometimes is different from the real value) and at least 2 measurements were done. Naturally the output time and the current load may have had an influence (but probably not a drastic one).
>
> pattern vrs real user sys
> ------------ --- ---- ---- ---
> \([a-b]\)\1
> 1: 24.3 23.6 0.6
> 2: 18.2 18.2 0.0
> 3: 1.9 1.9 0.0
>
> \([a-b]\)
> 1: 9.4 8.8 0.6
> 2: 3.4 3.4 0.0
> 3: 0.7 0.6 0.1
>
> [a-b]
> 1: 8.8 8.1 0.7
> 2: 3.2 3.1 0.0
> 3: 0.4 0.4 0.1
>
> "GET /dox/.*-[0-9\.]*.*/.*\.html.* HTTP/1.1" 410
> 1: 7.62 7.60 0.02
> 2: 0.33 0.32 0.01
> 3: 0.29 0.28 0.01
>
> No idea if that values are meaningful but as a layman I have the impression grep version 3 is faster than 2 and 2 is faster than 1 ;-)
>
> >> By the way I had to remove one of the two "=" in your patch otherwise >> gcc issued an error (but caution, I am a C-layman).
> >
> > Thanks, I fixed it. I made a mistake before sending the patch. Of
> > course, "=" should be one.
>
> No problem, I used exact your new patch version.
>
> Regards
>
> Jens
>
> P.S.: OT and you are probably the wrong address: I would like to see some "agrep" functionality in GNU grep.
Thanks. I expect that this patch improves pattern starting with range,
character class, collating element and equivalent class in a multibyte
locale as three of until third case. forth case seems to be mainly
improved by other optimization.
> However, if grep uses fastmap, fails in case-fold-titlecase test. It
> means that grep's behavior differ from sed and gawk, as they use fastmap,
> although it seems to be a bug in regex.
However, I think this patch should be suspended because of this issue.
I reported it to glibc developers. https://sourceware.org/bugzilla/show_bug.cgi?id=20381
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Fri, 02 Sep 2016 05:33:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Fri, 02 Sep 2016 05:33:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 24009-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> I think this patch should be suspended because of this issue.
> I reported it to glibc developers. https://sourceware.org/bugzilla/show_bug.cgi?id=20381
After thinking about it a bit, I came up with a variant of the patch that gives
the performance improvement unless -i is used, so I installed the attached
patches. The first patch is mostly just refactoring this somewhat-crufty code
and fixing an O(N**2) reallocation problem. The second is the real improvement.
The second patch just captures the low-hanging fruit. For example, even with -i
we could use a fastmap if all the pattern's letters (including letters matched
by ranges) happen to avoid the glibc bug. Something like that might be worth
pursuing.
Since the attached patch fixes the test case that prompted the bug report I'm
closing the bug. We can reopen it, or open a new one, if someone wants to fix
the remaining performance glitches.
Thanks again for all these fixes!
[0001-grep-improve-dfasearch-storage-management.patch (text/x-diff, attachment)]
[0002-grep-use-regex-fastmap-unless-i.patch (text/x-diff, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24009
; Package
grep
.
(Fri, 02 Sep 2016 11:51:01 GMT)
Full text and
rfc822 format available.
Message #25 received at 24009-done <at> debbugs.gnu.org (full text, mbox):
On Thu, 1 Sep 2016 22:32:12 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Norihiro Tanaka wrote:
> > I think this patch should be suspended because of this issue.
> > I reported it to glibc developers. https://sourceware.org/bugzilla/show_bug.cgi?id=20381
>
> After thinking about it a bit, I came up with a variant of the patch that gives the performance improvement unless -i is used, so I installed the attached patches. The first patch is mostly just refactoring this somewhat-crufty code and fixing an O(N**2) reallocation problem. The second is the real improvement.
>
> The second patch just captures the low-hanging fruit. For example, even with -i we could use a fastmap if all the pattern's letters (including letters matched by ranges) happen to avoid the glibc bug. Something like that might be worth pursuing.
>
> Since the attached patch fixes the test case that prompted the bug report I'm closing the bug. We can reopen it, or open a new one, if someone wants to fix the remaining performance glitches.
>
> Thanks again for all these fixes!
Nice! I could not find the workaround. I will occasionally check the
bug in glibc, and if it is fixed, I will write a patch again.
Thanks.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 01 Oct 2016 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 321 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.