GNU bug report logs - #24161
[PATCH 2/2] sed: speed up matching by reguler expression with dfa matcher

Previous Next

Package: sed;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Fri, 5 Aug 2016 14:05:01 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24161 in the body.
You can then email your comments to 24161 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Fri, 05 Aug 2016 14:05:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Fri, 05 Aug 2016 14:05:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-sed <at> gnu.org>
Subject: [PATCH 2/2] sed: speed up matching by reguler expression with dfa
 matcher
Date: Fri, 05 Aug 2016 23:03:26 +0900
[Message part 1 (text/plain, inline)]
Hi,

We can speeds up sed by using dfa matcher brought from grep.  gawk users
it, sed does not uses it yet.  It will speed up matching for typical
cases.

$ yes $(printf %040d 0) | head -1000000 >k

Before:

]$ time -p env LC_ALL=C sed/sed -ne /000000000k/p k
real 3.04
user 2.99
sys 0.03
$ time -p env LC_ALL=en_US.utf8 sed/sed -ne /000000000k/p k
real 3.04
user 2.90
sys 0.06
$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /000000000k/p k
real 7.09
user 6.77
sys 0.31

After patching:

$ time -p env LC_ALL=C sed/sed -ne /000000000k/p k
real 0.29
user 0.15
sys 0.10
$ time -p env LC_ALL=en_US.utf8 sed/sed -ne /000000000k/p k
real 0.27
user 0.25
sys 0.02
$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /000000000k/p k
real 0.33
user 0.29
sys 0.03

I believe that this patch can greatly improve performance of matching by
sed, however I worry about the maintenance as updates for dfa is always
done in grep.

Thanks,
Norihiro
[0002-sed-speed-up-matching-by-reguler-expression-with-dfa.patch (text/plain, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Fri, 05 Aug 2016 14:52:02 GMT) Full text and rfc822 format available.

Message #8 received at 24161 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 24161 <at> debbugs.gnu.org
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Fri, 5 Aug 2016 10:51:40 -0400
Hello Norihiro,

On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
> We can speeds up sed by using dfa matcher brought from grep.  gawk users
> it, sed does not uses it yet.  It will speed up matching for typical
> cases.
[...]
> I believe that this patch can greatly improve performance of matching by
> sed, however I worry about the maintenance as updates for dfa is always
> done in grep.

Nice improvement, thank you for the patch!

Keeping up with grep should not be a problem - Jim Meyering is the maintainer of both sed and grep (and I also try to keep up-to-date with grep's changes).

I wonder, if this code is used by multiple gnu projects, wouldn't it be better to include it in gnulib (instead of duplicating the code) ?

I also see from a cursory look that the test coverage is low (which is expected: sed currently use the native regex code and does not test specifically for it). I'm happy to try and add more tests, but it'll take a bit of time.

regards,
 - assaf






Information forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Fri, 05 Aug 2016 15:13:02 GMT) Full text and rfc822 format available.

Message #11 received at 24161 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 24161 <at> debbugs.gnu.org
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Fri, 5 Aug 2016 08:11:35 -0700
On Fri, Aug 5, 2016 at 7:51 AM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> Hello Norihiro,
>
> On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
>>
>> We can speeds up sed by using dfa matcher brought from grep.  gawk users
>> it, sed does not uses it yet.  It will speed up matching for typical
>> cases.
>
> [...]
>>
>> I believe that this patch can greatly improve performance of matching by
>> sed, however I worry about the maintenance as updates for dfa is always
>> done in grep.
>
> Nice improvement, thank you for the patch!

Nice one, indeed.
I hope to review it by this weekend.

> I wonder, if this code is used by multiple gnu projects, wouldn't it be
> better to include it in gnulib (instead of duplicating the code) ?

Yes. However, I'm happy to move dfa.[ch] into gnulib after this patch lands.

> I also see from a cursory look that the test coverage is low (which is
> expected: sed currently use the native regex code and does not test
> specifically for it). I'm happy to try and add more tests, but it'll take a
> bit of time.

Eventually, we'll want dfa's tests to reside in gnulib.




Information forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Sun, 07 Aug 2016 00:51:01 GMT) Full text and rfc822 format available.

Message #14 received at 24161 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 24161 <at> debbugs.gnu.org
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Sun, 07 Aug 2016 09:50:35 +0900
On Fri, 5 Aug 2016 10:51:40 -0400
Assaf Gordon <assafgordon <at> gmail.com> wrote:

> Hello Norihiro,
> 
> On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
> > We can speeds up sed by using dfa matcher brought from grep.  gawk users
> > it, sed does not uses it yet.  It will speed up matching for typical
> > cases.
> [...]
> > I believe that this patch can greatly improve performance of matching by
> > sed, however I worry about the maintenance as updates for dfa is always
> > done in grep.
> 
> Nice improvement, thank you for the patch!
> 
> Keeping up with grep should not be a problem - Jim Meyering is the maintainer of both sed and grep (and I also try to keep up-to-date with grep's changes).
> 
> I wonder, if this code is used by multiple gnu projects, wouldn't it be better to include it in gnulib (instead of duplicating the code) ?
> 
> I also see from a cursory look that the test coverage is low (which is expected: sed currently use the native regex code and does not test specifically for it). I'm happy to try and add more tests, but it'll take a bit of time.
> 
> regards,
>   - assaf

Hi Assaf and Jim,

Thanks for considering the patch.  I also think that it is better to
include in gnulib.

Thanks,
Norihiro





Information forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Sun, 07 Aug 2016 02:21:01 GMT) Full text and rfc822 format available.

Message #17 received at 24161 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 24161 <at> debbugs.gnu.org
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Sat, 6 Aug 2016 19:20:22 -0700
[Message part 1 (text/plain, inline)]
On Sat, Aug 6, 2016 at 5:50 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Fri, 5 Aug 2016 10:51:40 -0400
> Assaf Gordon <assafgordon <at> gmail.com> wrote:
>
>> Hello Norihiro,
>>
>> On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
>> > We can speeds up sed by using dfa matcher brought from grep.  gawk users
>> > it, sed does not uses it yet.  It will speed up matching for typical
>> > cases.
>> [...]
>> > I believe that this patch can greatly improve performance of matching by
>> > sed, however I worry about the maintenance as updates for dfa is always
>> > done in grep.
>>
>> Nice improvement, thank you for the patch!

Thanks again.
I've revised it as follows and expect to push tomorrow:
  - remove the abort and comment from dfaerror -- should is not
necessary, given the _Noreturn attribute.
  - adjusted commit log and NEWS entry, also moving the "Improvements"
section to the top
  - sorted source file names in local.mk (they were not sorted before, either)
  - added the "make syntax-check"-required mention of sed/dfa.c in
po/POTFILES.in
[0001-sed-use-grep-s-DFA-matcher-to-speed-up-regular-expre.diff (text/plain, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Mon, 08 Aug 2016 23:30:02 GMT) Full text and rfc822 format available.

Message #20 received at 24161 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 24161 <at> debbugs.gnu.org
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Tue, 09 Aug 2016 08:29:42 +0900
On Sat, 6 Aug 2016 19:20:22 -0700
Jim Meyering <jim <at> meyering.net> wrote:

> Thanks again.
> I've revised it as follows and expect to push tomorrow:
>   - remove the abort and comment from dfaerror -- should is not
> necessary, given the _Noreturn attribute.
>   - adjusted commit log and NEWS entry, also moving the "Improvements"
> section to the top
>   - sorted source file names in local.mk (they were not sorted before, either)
>   - added the "make syntax-check"-required mention of sed/dfa.c in
> po/POTFILES.in

Thanks for adjusting.  I agree the all changes.





Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Tue, 09 Aug 2016 00:54:01 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Tue, 09 Aug 2016 00:54:02 GMT) Full text and rfc822 format available.

Message #25 received at 24161-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 24161-done <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Mon, 8 Aug 2016 17:53:04 -0700
On Mon, Aug 8, 2016 at 4:29 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Sat, 6 Aug 2016 19:20:22 -0700
> Jim Meyering <jim <at> meyering.net> wrote:
>
>> Thanks again.
>> I've revised it as follows and expect to push tomorrow:
>>   - remove the abort and comment from dfaerror -- should is not
>> necessary, given the _Noreturn attribute.
>>   - adjusted commit log and NEWS entry, also moving the "Improvements"
>> section to the top
>>   - sorted source file names in local.mk (they were not sorted before, either)
>>   - added the "make syntax-check"-required mention of sed/dfa.c in
>> po/POTFILES.in
>
> Thanks for adjusting.  I agree the all changes.

Pushed.




Information forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Tue, 09 Aug 2016 01:34:01 GMT) Full text and rfc822 format available.

Message #28 received at 24161-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 24161-done <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Mon, 8 Aug 2016 18:33:26 -0700
[Message part 1 (text/plain, inline)]
On Mon, Aug 8, 2016 at 5:53 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Mon, Aug 8, 2016 at 4:29 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
...
>> Thanks for adjusting.  I agree the all changes.
>
> Pushed.

Oops. I had not run ASAN-enabled tests with this patch.
Doing that exposed many test failures like this:

LC_ALL=C  ./sed/sed -f ./testsuite/madding.sed \
        < ./testsuite/madding.inp | LC_ALL=C tr -d \\r > madding.out
=================================================================
==8797==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x622000019646 at pc 0x0000004e98ce bp 0x7ffee5d23130 sp
0x7ffee5d23128
READ of size 1 at 0x622000019646 thread T0
    #0 0x4e98cd in dfaexec_main sed/dfa.c:3078
    #1 0x4e98cd in dfaexec_sb sed/dfa.c:3242
    #2 0x4ece9f in dfaexec sed/dfa.c:3263
    #3 0x4f64ba in match_regex sed/regexp.c:270
    #4 0x4f46ac in do_subst sed/execute.c:1024
    #5 0x4f46ac in execute_program sed/execute.c:1510
    #6 0x4f5997 in process_files sed/execute.c:1678
    #7 0x4f6f64 in main sed/sed.c:363
    #8 0x7fe665f89730 in __libc_start_main (/lib64/libc.so.6+0x20730)
    #9 0x407118 in _start (/home/j/w/co/sed/sed/sed+0x407118)

which I am fixing with this patch:
[0001-sed-avoid-one-byte-heap-buffer-overrun.diff (text/plain, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#24161; Package sed. (Tue, 09 Aug 2016 14:00:02 GMT) Full text and rfc822 format available.

Message #31 received at 24161-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 24161-done <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>
Subject: Re: bug#24161: [PATCH 2/2] sed: speed up matching by reguler
 expression with dfa matcher
Date: Tue, 09 Aug 2016 22:59:33 +0900
On Mon, 8 Aug 2016 18:33:26 -0700
Jim Meyering <jim <at> meyering.net> wrote:

> On Mon, Aug 8, 2016 at 5:53 PM, Jim Meyering <jim <at> meyering.net> wrote:
> > On Mon, Aug 8, 2016 at 4:29 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> ...
> >> Thanks for adjusting.  I agree the all changes.
> >
> > Pushed.
> 
> Oops. I had not run ASAN-enabled tests with this patch.
> Doing that exposed many test failures like this:
> 
> LC_ALL=C  ./sed/sed -f ./testsuite/madding.sed \
>         < ./testsuite/madding.inp | LC_ALL=C tr -d \\r > madding.out
> =================================================================
> ==8797==ERROR: AddressSanitizer: heap-buffer-overflow on address
> 0x622000019646 at pc 0x0000004e98ce bp 0x7ffee5d23130 sp
> 0x7ffee5d23128
> READ of size 1 at 0x622000019646 thread T0
>     #0 0x4e98cd in dfaexec_main sed/dfa.c:3078
>     #1 0x4e98cd in dfaexec_sb sed/dfa.c:3242
>     #2 0x4ece9f in dfaexec sed/dfa.c:3263
>     #3 0x4f64ba in match_regex sed/regexp.c:270
>     #4 0x4f46ac in do_subst sed/execute.c:1024
>     #5 0x4f46ac in execute_program sed/execute.c:1510
>     #6 0x4f5997 in process_files sed/execute.c:1678
>     #7 0x4f6f64 in main sed/sed.c:363
>     #8 0x7fe665f89730 in __libc_start_main (/lib64/libc.so.6+0x20730)
>     #9 0x407118 in _start (/home/j/w/co/sed/sed/sed+0x407118)
> 
> which I am fixing with this patch:

Special thanks for catching up abd fixing heap-buffer-overflow bug.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 07 Sep 2016 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 283 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.