GNU bug report logs - #24009
[PATCH] grep: use fastmap in regex

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 16 Jul 2016 16:59:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24009 in the body.
You can then email your comments to 24009 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#24009; Package grep. (Sat, 16 Jul 2016 16:59:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 16 Jul 2016 16:59:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: [PATCH] grep: use fastmap in regex
Date: Sun, 17 Jul 2016 01:57:30 +0900
[Message part 1 (text/plain, inline)]
sed and gawk use fastmap in regex, but grep does not.  By using fastmap,
I expect that grep speeds up for patterns as regex is used.

before:
$ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
real 7.83
user 7.62
sys 0.07

after:
$ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
real 0.46
user 0.38
sys 0.07

However, if grep uses fastmap, fails in case-fold-titlecase test.  It
means that grep's behavior differ from sed and gawk, as they use fastmap,
although it seems to be a bug in regex.
[0001-grep-use-fastmap-in-regex.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24009; Package grep. (Sat, 16 Jul 2016 20:08:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Jens Schleusener" <Jens.Schleusener <at> t-online.de>
To: bug-grep <at> gnu.org
Subject: Re: bug#24009: [PATCH] grep: use fastmap in regex
Date: Sat, 16 Jul 2016 22:06:53 +0200 (CEST)
Hi Norihiro.

> sed and gawk use fastmap in regex, but grep does not.  By using fastmap,
> I expect that grep speeds up for patterns as regex is used.
>
> before:
> $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
> real 7.83
> user 7.62
> sys 0.07
>
> after:
> $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k
> real 0.46
> user 0.38
> sys 0.07
>
> However, if grep uses fastmap, fails in case-fold-titlecase test.  It
> means that grep's behavior differ from sed and gawk, as they use fastmap,
> although it seems to be a bug in regex.

Wow, that is a spectacular speed improvement. Since I use grep with regex 
patterns heavily in some of my scripts I could not resist to make some 
first simple tests (including your example pattern with a back reference). 
The non-representative results using grep 2.25 shows a gain of a factor 
5-10 (while the unpatched self-compiled grep 2.25 itself was already a 
factor 1.4-2.8 faster than the grep 2.16 offered by the OS (OpenSUSE Leap 
42.1). At least in my tests all the grep outputs  were identical.

By the way I had to remove one of the two "=" in your patch otherwise gcc 
issued an error (but caution, I am a C-layman).

Regards

Jens




Information forwarded to bug-grep <at> gnu.org:
bug#24009; Package grep. (Sun, 17 Jul 2016 00:56:02 GMT) Full text and rfc822 format available.

Message #11 received at 24009 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: "Jens Schleusener" <Jens.Schleusener <at> t-online.de>
Cc: 24009 <at> debbugs.gnu.org
Subject: Re: bug#24009: [PATCH] grep: use fastmap in regex
Date: Sun, 17 Jul 2016 09:55:00 +0900
[Message part 1 (text/plain, inline)]
On Sat, 16 Jul 2016 22:06:53 +0200 (CEST)
"Jens Schleusener" <Jens.Schleusener <at> t-online.de> wrote:

> Wow, that is a spectacular speed improvement. Since I use grep with
> regex patterns heavily in some of my scripts I could not resist to
> make some first simple tests (including your example pattern with a
> back reference). The non-representative results using grep 2.25 shows
> a gain of a factor 5-10 (while the unpatched self-compiled grep 2.25
> itself was already a factor 1.4-2.8 faster than the grep 2.16 offered
> by the OS (OpenSUSE Leap 42.1). At least in my tests all the grep
> outputs  were identical.

I believe that cases to speed up by this patch is not so much, as grep
makes a lot of other optimizations.  In fact, I spent a little time to
make a test case to demonstrate to speed up by this patch.  So I have an
interest with what kind of test cases you could confirm to speed up by
this patch.

> By the way I had to remove one of the two "=" in your patch otherwise
> gcc issued an error (but caution, I am a C-layman).

Thanks, I fixed it.  I made a mistake before sending the patch.  Of
course, "=" should be one.
[0001-grep-use-fastmap-in-regex.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24009; Package grep. (Sun, 17 Jul 2016 10:22:02 GMT) Full text and rfc822 format available.

Message #14 received at 24009 <at> debbugs.gnu.org (full text, mbox):

From: Jens Schleusener <Jens.Schleusener <at> t-online.de>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 24009 <at> debbugs.gnu.org
Subject: Re: bug#24009: [PATCH] grep: use fastmap in regex
Date: Sun, 17 Jul 2016 12:21:46 +0200 (CEST)
On Sun, 17 Jul 2016, Norihiro Tanaka wrote:

> On Sat, 16 Jul 2016 22:06:53 +0200 (CEST)
> "Jens Schleusener" <Jens.Schleusener <at> t-online.de> wrote:
>
>> Wow, that is a spectacular speed improvement. Since I use grep with
>> regex patterns heavily in some of my scripts I could not resist to
>> make some first simple tests (including your example pattern with a
>> back reference). The non-representative results using grep 2.25 shows
>> a gain of a factor 5-10 (while the unpatched self-compiled grep 2.25
>> itself was already a factor 1.4-2.8 faster than the grep 2.16 offered
>> by the OS (OpenSUSE Leap 42.1). At least in my tests all the grep
>> outputs  were identical.
>
> I believe that cases to speed up by this patch is not so much, as grep
> makes a lot of other optimizations.  In fact, I spent a little time to
> make a test case to demonstrate to speed up by this patch.  So I have an
> interest with what kind of test cases you could confirm to speed up by
> this patch.

First, as I mentioned in my mail, my "tests" are non-representative
and done on a server system that runs also other jobs just to get a first 
impression.

Currently I have redone some of the tests here are the more detailed 
results (hopefully readable in this mail).

 OS: OpenSUSE Leap 42.1 (64-bit)
 gcc: version 4.8.5 (SUSE Linux)

Main testfile was an Apache access log file (in nearly combined log 
format) with a size of 157 MB and 673623 lines that looks like:

 66.249.78.85 - - [16/Jul/2016:00:00:02 +0200] "GET /dox/phpMyAdmin-4.6.1-all-languages/namespacePMA_1_1libraries_1_1properties_1_1options_1_1items_1_1TextPropertyItem.html HTTP/1.1" 410 1977 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 - -

The test command was

 time -p env LC_ALL=en_US.UTF-8 grep pattern logfile > /dev/shm/out

The tested grep versions:

 1: GNU grep 2.16 (OpenSuSE Leap 42.1)
 2: GNU grep 2.25 (self-compiled)
 3: GNU grep 2.25 (self-compiled with 0001-grep-use-fastmap-in-regex.patch)

"self-compiled" means just "./configure; make; make install".

Times are in seconds (rounded; so the sum user+sys sometimes is different 
from the real value) and at least 2 measurements were done. Naturally the 
output time and the current load may have had an influence (but probably 
not a drastic one).

pattern      vrs real  user  sys
------------ --- ----  ----  ---
\([a-b]\)\1
              1: 24.3  23.6  0.6
              2: 18.2  18.2  0.0
              3:  1.9   1.9  0.0

\([a-b]\)
              1:  9.4   8.8  0.6
              2:  3.4   3.4  0.0
              3:  0.7   0.6  0.1

[a-b]
              1:  8.8   8.1  0.7
              2:  3.2   3.1  0.0
              3:  0.4   0.4  0.1

"GET /dox/.*-[0-9\.]*.*/.*\.html.* HTTP/1.1" 410
              1:  7.62  7.60 0.02
              2:  0.33  0.32 0.01
              3:  0.29  0.28 0.01

No idea if that values are meaningful but as a layman I have the 
impression grep version 3 is faster than 2 and 2 is faster than 1 ;-)

>> By the way I had to remove one of the two "=" in your patch otherwise 
>> gcc issued an error (but caution, I am a C-layman).
>
> Thanks, I fixed it.  I made a mistake before sending the patch.  Of
> course, "=" should be one.

No problem, I used exact your new patch version.

Regards

Jens

P.S.: OT and you are probably the wrong address: I would like to see 
some "agrep" functionality in GNU grep.




Information forwarded to bug-grep <at> gnu.org:
bug#24009; Package grep. (Sun, 17 Jul 2016 12:29:02 GMT) Full text and rfc822 format available.

Message #17 received at 24009 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jens Schleusener <Jens.Schleusener <at> t-online.de>
Cc: 24009 <at> debbugs.gnu.org
Subject: Re: bug#24009: [PATCH] grep: use fastmap in regex
Date: Sun, 17 Jul 2016 21:27:55 +0900
On Sun, 17 Jul 2016 12:21:46 +0200 (CEST)
Jens Schleusener <Jens.Schleusener <at> t-online.de> wrote:

> First, as I mentioned in my mail, my "tests" are non-representative
> and done on a server system that runs also other jobs just to get a first impression.
> 
> Currently I have redone some of the tests here are the more detailed results (hopefully readable in this mail).
> 
>   OS: OpenSUSE Leap 42.1 (64-bit)
>   gcc: version 4.8.5 (SUSE Linux)
> 
> Main testfile was an Apache access log file (in nearly combined log format) with a size of 157 MB and 673623 lines that looks like:
> 
>   66.249.78.85 - - [16/Jul/2016:00:00:02 +0200] "GET /dox/phpMyAdmin-4.6.1-all-languages/namespacePMA_1_1libraries_1_1properties_1_1options_1_1items_1_1TextPropertyItem.html HTTP/1.1" 410 1977 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 - -
> 
> The test command was
> 
>   time -p env LC_ALL=en_US.UTF-8 grep pattern logfile > /dev/shm/out
> 
> The tested grep versions:
> 
>   1: GNU grep 2.16 (OpenSuSE Leap 42.1)
>   2: GNU grep 2.25 (self-compiled)
>   3: GNU grep 2.25 (self-compiled with 0001-grep-use-fastmap-in-regex.patch)
> 
> "self-compiled" means just "./configure; make; make install".
> 
> Times are in seconds (rounded; so the sum user+sys sometimes is different from the real value) and at least 2 measurements were done. Naturally the output time and the current load may have had an influence (but probably not a drastic one).
> 
> pattern      vrs real  user  sys
> ------------ --- ----  ----  ---
> \([a-b]\)\1
>                1: 24.3  23.6  0.6
>                2: 18.2  18.2  0.0
>                3:  1.9   1.9  0.0
> 
> \([a-b]\)
>                1:  9.4   8.8  0.6
>                2:  3.4   3.4  0.0
>                3:  0.7   0.6  0.1
> 
> [a-b]
>                1:  8.8   8.1  0.7
>                2:  3.2   3.1  0.0
>                3:  0.4   0.4  0.1
> 
> "GET /dox/.*-[0-9\.]*.*/.*\.html.* HTTP/1.1" 410
>                1:  7.62  7.60 0.02
>                2:  0.33  0.32 0.01
>                3:  0.29  0.28 0.01
> 
> No idea if that values are meaningful but as a layman I have the impression grep version 3 is faster than 2 and 2 is faster than 1 ;-)
> 
> >> By the way I had to remove one of the two "=" in your patch otherwise >> gcc issued an error (but caution, I am a C-layman).
> >
> > Thanks, I fixed it.  I made a mistake before sending the patch.  Of
> > course, "=" should be one.
> 
> No problem, I used exact your new patch version.
> 
> Regards
> 
> Jens
> 
> P.S.: OT and you are probably the wrong address: I would like to see some "agrep" functionality in GNU grep.

Thanks.  I expect that this patch improves pattern starting with range,
character class, collating element and equivalent class in a multibyte
locale as three of until third case.  forth case seems to be mainly
improved by other optimization.

> However, if grep uses fastmap, fails in case-fold-titlecase test.  It
> means that grep's behavior differ from sed and gawk, as they use fastmap,
> although it seems to be a bug in regex.

However, I think this patch should be suspended because of this issue.
I reported it to glibc developers.  https://sourceware.org/bugzilla/show_bug.cgi?id=20381






Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 02 Sep 2016 05:33:01 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Fri, 02 Sep 2016 05:33:02 GMT) Full text and rfc822 format available.

Message #22 received at 24009-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 24009-done <at> debbugs.gnu.org, Jens Schleusener <Jens.Schleusener <at> t-online.de>
Subject: Re: bug#24009: [PATCH] grep: use fastmap in regex
Date: Thu, 1 Sep 2016 22:32:12 -0700
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> I think this patch should be suspended because of this issue.
> I reported it to glibc developers.  https://sourceware.org/bugzilla/show_bug.cgi?id=20381

After thinking about it a bit, I came up with a variant of the patch that gives 
the performance improvement unless -i is used, so I installed the attached 
patches. The first patch is mostly just refactoring this somewhat-crufty code 
and fixing an O(N**2) reallocation problem. The second is the real improvement.

The second patch just captures the low-hanging fruit. For example, even with -i 
we could use a fastmap if all the pattern's letters (including letters matched 
by ranges) happen to avoid the glibc bug. Something like that might be worth 
pursuing.

Since the attached patch fixes the test case that prompted the bug report I'm 
closing the bug. We can reopen it, or open a new one, if someone wants to fix 
the remaining performance glitches.

Thanks again for all these fixes!
[0001-grep-improve-dfasearch-storage-management.patch (text/x-diff, attachment)]
[0002-grep-use-regex-fastmap-unless-i.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24009; Package grep. (Fri, 02 Sep 2016 11:51:01 GMT) Full text and rfc822 format available.

Message #25 received at 24009-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 24009-done <at> debbugs.gnu.org, Jens Schleusener <Jens.Schleusener <at> t-online.de>
Subject: Re: bug#24009: [PATCH] grep: use fastmap in regex
Date: Fri, 02 Sep 2016 20:50:27 +0900
On Thu, 1 Sep 2016 22:32:12 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> > I think this patch should be suspended because of this issue.
> > I reported it to glibc developers.  https://sourceware.org/bugzilla/show_bug.cgi?id=20381
> 
> After thinking about it a bit, I came up with a variant of the patch that gives the performance improvement unless -i is used, so I installed the attached patches. The first patch is mostly just refactoring this somewhat-crufty code and fixing an O(N**2) reallocation problem. The second is the real improvement.
> 
> The second patch just captures the low-hanging fruit. For example, even with -i we could use a fastmap if all the pattern's letters (including letters matched by ranges) happen to avoid the glibc bug. Something like that might be worth pursuing.
> 
> Since the attached patch fixes the test case that prompted the bug report I'm closing the bug. We can reopen it, or open a new one, if someone wants to fix the remaining performance glitches.
> 
> Thanks again for all these fixes!

Nice!  I could not find the workaround.  I will occasionally check the
bug in glibc, and if it is fixed, I will write a patch again.

Thanks.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 01 Oct 2016 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 321 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.