GNU bug report logs -
#24161
[PATCH 2/2] sed: speed up matching by reguler expression with dfa matcher
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Fri, 5 Aug 2016 14:05:01 UTC
Severity: normal
Tags: patch
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24161 in the body.
You can then email your comments to 24161 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Fri, 05 Aug 2016 14:05:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-sed <at> gnu.org
.
(Fri, 05 Aug 2016 14:05:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
We can speeds up sed by using dfa matcher brought from grep. gawk users
it, sed does not uses it yet. It will speed up matching for typical
cases.
$ yes $(printf %040d 0) | head -1000000 >k
Before:
]$ time -p env LC_ALL=C sed/sed -ne /000000000k/p k
real 3.04
user 2.99
sys 0.03
$ time -p env LC_ALL=en_US.utf8 sed/sed -ne /000000000k/p k
real 3.04
user 2.90
sys 0.06
$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /000000000k/p k
real 7.09
user 6.77
sys 0.31
After patching:
$ time -p env LC_ALL=C sed/sed -ne /000000000k/p k
real 0.29
user 0.15
sys 0.10
$ time -p env LC_ALL=en_US.utf8 sed/sed -ne /000000000k/p k
real 0.27
user 0.25
sys 0.02
$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /000000000k/p k
real 0.33
user 0.29
sys 0.03
I believe that this patch can greatly improve performance of matching by
sed, however I worry about the maintenance as updates for dfa is always
done in grep.
Thanks,
Norihiro
[0002-sed-speed-up-matching-by-reguler-expression-with-dfa.patch (text/plain, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Fri, 05 Aug 2016 14:52:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 24161 <at> debbugs.gnu.org (full text, mbox):
Hello Norihiro,
On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
> We can speeds up sed by using dfa matcher brought from grep. gawk users
> it, sed does not uses it yet. It will speed up matching for typical
> cases.
[...]
> I believe that this patch can greatly improve performance of matching by
> sed, however I worry about the maintenance as updates for dfa is always
> done in grep.
Nice improvement, thank you for the patch!
Keeping up with grep should not be a problem - Jim Meyering is the maintainer of both sed and grep (and I also try to keep up-to-date with grep's changes).
I wonder, if this code is used by multiple gnu projects, wouldn't it be better to include it in gnulib (instead of duplicating the code) ?
I also see from a cursory look that the test coverage is low (which is expected: sed currently use the native regex code and does not test specifically for it). I'm happy to try and add more tests, but it'll take a bit of time.
regards,
- assaf
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Fri, 05 Aug 2016 15:13:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 24161 <at> debbugs.gnu.org (full text, mbox):
On Fri, Aug 5, 2016 at 7:51 AM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> Hello Norihiro,
>
> On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
>>
>> We can speeds up sed by using dfa matcher brought from grep. gawk users
>> it, sed does not uses it yet. It will speed up matching for typical
>> cases.
>
> [...]
>>
>> I believe that this patch can greatly improve performance of matching by
>> sed, however I worry about the maintenance as updates for dfa is always
>> done in grep.
>
> Nice improvement, thank you for the patch!
Nice one, indeed.
I hope to review it by this weekend.
> I wonder, if this code is used by multiple gnu projects, wouldn't it be
> better to include it in gnulib (instead of duplicating the code) ?
Yes. However, I'm happy to move dfa.[ch] into gnulib after this patch lands.
> I also see from a cursory look that the test coverage is low (which is
> expected: sed currently use the native regex code and does not test
> specifically for it). I'm happy to try and add more tests, but it'll take a
> bit of time.
Eventually, we'll want dfa's tests to reside in gnulib.
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Sun, 07 Aug 2016 00:51:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 24161 <at> debbugs.gnu.org (full text, mbox):
On Fri, 5 Aug 2016 10:51:40 -0400
Assaf Gordon <assafgordon <at> gmail.com> wrote:
> Hello Norihiro,
>
> On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
> > We can speeds up sed by using dfa matcher brought from grep. gawk users
> > it, sed does not uses it yet. It will speed up matching for typical
> > cases.
> [...]
> > I believe that this patch can greatly improve performance of matching by
> > sed, however I worry about the maintenance as updates for dfa is always
> > done in grep.
>
> Nice improvement, thank you for the patch!
>
> Keeping up with grep should not be a problem - Jim Meyering is the maintainer of both sed and grep (and I also try to keep up-to-date with grep's changes).
>
> I wonder, if this code is used by multiple gnu projects, wouldn't it be better to include it in gnulib (instead of duplicating the code) ?
>
> I also see from a cursory look that the test coverage is low (which is expected: sed currently use the native regex code and does not test specifically for it). I'm happy to try and add more tests, but it'll take a bit of time.
>
> regards,
> - assaf
Hi Assaf and Jim,
Thanks for considering the patch. I also think that it is better to
include in gnulib.
Thanks,
Norihiro
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Sun, 07 Aug 2016 02:21:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 24161 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Sat, Aug 6, 2016 at 5:50 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Fri, 5 Aug 2016 10:51:40 -0400
> Assaf Gordon <assafgordon <at> gmail.com> wrote:
>
>> Hello Norihiro,
>>
>> On 08/05/2016 10:03 AM, Norihiro Tanaka wrote:
>> > We can speeds up sed by using dfa matcher brought from grep. gawk users
>> > it, sed does not uses it yet. It will speed up matching for typical
>> > cases.
>> [...]
>> > I believe that this patch can greatly improve performance of matching by
>> > sed, however I worry about the maintenance as updates for dfa is always
>> > done in grep.
>>
>> Nice improvement, thank you for the patch!
Thanks again.
I've revised it as follows and expect to push tomorrow:
- remove the abort and comment from dfaerror -- should is not
necessary, given the _Noreturn attribute.
- adjusted commit log and NEWS entry, also moving the "Improvements"
section to the top
- sorted source file names in local.mk (they were not sorted before, either)
- added the "make syntax-check"-required mention of sed/dfa.c in
po/POTFILES.in
[0001-sed-use-grep-s-DFA-matcher-to-speed-up-regular-expre.diff (text/plain, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Mon, 08 Aug 2016 23:30:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 24161 <at> debbugs.gnu.org (full text, mbox):
On Sat, 6 Aug 2016 19:20:22 -0700
Jim Meyering <jim <at> meyering.net> wrote:
> Thanks again.
> I've revised it as follows and expect to push tomorrow:
> - remove the abort and comment from dfaerror -- should is not
> necessary, given the _Noreturn attribute.
> - adjusted commit log and NEWS entry, also moving the "Improvements"
> section to the top
> - sorted source file names in local.mk (they were not sorted before, either)
> - added the "make syntax-check"-required mention of sed/dfa.c in
> po/POTFILES.in
Thanks for adjusting. I agree the all changes.
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Tue, 09 Aug 2016 00:54:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Tue, 09 Aug 2016 00:54:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 24161-done <at> debbugs.gnu.org (full text, mbox):
On Mon, Aug 8, 2016 at 4:29 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Sat, 6 Aug 2016 19:20:22 -0700
> Jim Meyering <jim <at> meyering.net> wrote:
>
>> Thanks again.
>> I've revised it as follows and expect to push tomorrow:
>> - remove the abort and comment from dfaerror -- should is not
>> necessary, given the _Noreturn attribute.
>> - adjusted commit log and NEWS entry, also moving the "Improvements"
>> section to the top
>> - sorted source file names in local.mk (they were not sorted before, either)
>> - added the "make syntax-check"-required mention of sed/dfa.c in
>> po/POTFILES.in
>
> Thanks for adjusting. I agree the all changes.
Pushed.
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Tue, 09 Aug 2016 01:34:01 GMT)
Full text and
rfc822 format available.
Message #28 received at 24161-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Mon, Aug 8, 2016 at 5:53 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Mon, Aug 8, 2016 at 4:29 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
...
>> Thanks for adjusting. I agree the all changes.
>
> Pushed.
Oops. I had not run ASAN-enabled tests with this patch.
Doing that exposed many test failures like this:
LC_ALL=C ./sed/sed -f ./testsuite/madding.sed \
< ./testsuite/madding.inp | LC_ALL=C tr -d \\r > madding.out
=================================================================
==8797==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x622000019646 at pc 0x0000004e98ce bp 0x7ffee5d23130 sp
0x7ffee5d23128
READ of size 1 at 0x622000019646 thread T0
#0 0x4e98cd in dfaexec_main sed/dfa.c:3078
#1 0x4e98cd in dfaexec_sb sed/dfa.c:3242
#2 0x4ece9f in dfaexec sed/dfa.c:3263
#3 0x4f64ba in match_regex sed/regexp.c:270
#4 0x4f46ac in do_subst sed/execute.c:1024
#5 0x4f46ac in execute_program sed/execute.c:1510
#6 0x4f5997 in process_files sed/execute.c:1678
#7 0x4f6f64 in main sed/sed.c:363
#8 0x7fe665f89730 in __libc_start_main (/lib64/libc.so.6+0x20730)
#9 0x407118 in _start (/home/j/w/co/sed/sed/sed+0x407118)
which I am fixing with this patch:
[0001-sed-avoid-one-byte-heap-buffer-overrun.diff (text/plain, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24161
; Package
sed
.
(Tue, 09 Aug 2016 14:00:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 24161-done <at> debbugs.gnu.org (full text, mbox):
On Mon, 8 Aug 2016 18:33:26 -0700
Jim Meyering <jim <at> meyering.net> wrote:
> On Mon, Aug 8, 2016 at 5:53 PM, Jim Meyering <jim <at> meyering.net> wrote:
> > On Mon, Aug 8, 2016 at 4:29 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> ...
> >> Thanks for adjusting. I agree the all changes.
> >
> > Pushed.
>
> Oops. I had not run ASAN-enabled tests with this patch.
> Doing that exposed many test failures like this:
>
> LC_ALL=C ./sed/sed -f ./testsuite/madding.sed \
> < ./testsuite/madding.inp | LC_ALL=C tr -d \\r > madding.out
> =================================================================
> ==8797==ERROR: AddressSanitizer: heap-buffer-overflow on address
> 0x622000019646 at pc 0x0000004e98ce bp 0x7ffee5d23130 sp
> 0x7ffee5d23128
> READ of size 1 at 0x622000019646 thread T0
> #0 0x4e98cd in dfaexec_main sed/dfa.c:3078
> #1 0x4e98cd in dfaexec_sb sed/dfa.c:3242
> #2 0x4ece9f in dfaexec sed/dfa.c:3263
> #3 0x4f64ba in match_regex sed/regexp.c:270
> #4 0x4f46ac in do_subst sed/execute.c:1024
> #5 0x4f46ac in execute_program sed/execute.c:1510
> #6 0x4f5997 in process_files sed/execute.c:1678
> #7 0x4f6f64 in main sed/sed.c:363
> #8 0x7fe665f89730 in __libc_start_main (/lib64/libc.so.6+0x20730)
> #9 0x407118 in _start (/home/j/w/co/sed/sed/sed+0x407118)
>
> which I am fixing with this patch:
Special thanks for catching up abd fixing heap-buffer-overflow bug.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 07 Sep 2016 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 283 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.