GNU bug report logs - #23932
dfa: use algorithm for single byte character to any single byte character in input text always

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sun, 10 Jul 2016 09:53:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23932 in the body.
You can then email your comments to 23932 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#23932; Package grep. (Sun, 10 Jul 2016 09:53:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 10 Jul 2016 09:53:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: dfa: use algorithm for single byte character to any single byte
 character in input text always
Date: Sun, 10 Jul 2016 18:51:43 +0900
[Message part 1 (text/plain, inline)]
In multibyte locales, if a pattern start with period expression,
matching is still slow, as transition table is built at run time,
even when next character is single byte in input text.

This patch changes it into as use algorithm for single byte character to
any single byte character in input text always.  If transition table has
been built already and a next character in input text is single byte,
transit to next state by reference of only pre-built transition table,
even if from a state including ANYCHAR.

$ yes "$(printf 'a%038db\n' 0)" | head -1000000 >in
$ env LC_ALL=C gcc -v
Reading specs from /usr/local/lib/gcc/x86_64-pc-linux-gnu/4.4.7/specs
Target: x86_64-pc-linux-gnu
Configured with: ./configure --with-as=/usr/local/bin/as --with-ld=/usr/local/bin/ld --with-system-zlib --enable-__cxa_atexit
Thread model: posix
gcc version 4.4.7 (GCC)

patch#21486 is required before this patch.  grep will speed up by this
patch additionaly.

[grep-2.25]
$ time -p env LC_ALL=ja_JP.eucjp grep .a.b in
real 4.78
user 4.42
sys 0.16
$ time -p env LC_ALL=ja_JP.eucjp grep '.\{41\}' in
real 46.23
user 43.98
sys 0.21

[after patch#21486]
$ time -p env LC_ALL=ja_JP.eucjp src/grep .a.b in
real 1.26
user 1.08
sys 0.08
$ time -p env LC_ALL=ja_JP.eucjp src/grep '.\{41\}' in
real 1.14
user 1.00
sys 0.10

[after this patch]
$ time -p env LC_ALL=ja_JP.eucjp src/grep .a.b in
real 0.47
user 0.36
sys 0.07
$ time -p env LC_ALL=ja_JP.eucjp src/grep '.\{41\}' in
real 0.24
user 0.18
sys 0.05

[locale C (ref.)]
$ time -p env LC_ALL=C src/grep .a.b in
real 0.23
user 0.11
sys 0.09
$ time -p env LC_ALL=C src/grep '.\{41\}' in
real 0.22
user 0.13
sys 0.06
[0001-dfa-use-algorithm-for-single-byte-character-to-any-s.patch (text/plain, attachment)]

Added tag(s) patch. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Tue, 16 Aug 2016 07:57:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#23932; Package grep. (Tue, 16 Aug 2016 14:36:02 GMT) Full text and rfc822 format available.

Message #10 received at 23932 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 23932 <at> debbugs.gnu.org
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Tue, 16 Aug 2016 23:35:22 +0900
[Message part 1 (text/plain, inline)]
On Sun, 10 Jul 2016 18:51:43 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> In multibyte locales, if a pattern start with period expression,
> matching is still slow, as transition table is built at run time,
> even when next character is single byte in input text.
> 
> This patch changes it into as use algorithm for single byte character to
> any single byte character in input text always.  If transition table has
> been built already and a next character in input text is single byte,
> transit to next state by reference of only pre-built transition table,
> even if from a state including ANYCHAR.
> 
> $ yes "$(printf 'a%038db\n' 0)" | head -1000000 >in
> $ env LC_ALL=C gcc -v
> Reading specs from /usr/local/lib/gcc/x86_64-pc-linux-gnu/4.4.7/specs
> Target: x86_64-pc-linux-gnu
> Configured with: ./configure --with-as=/usr/local/bin/as --with-ld=/usr/local/bin/ld --with-system-zlib --enable-__cxa_atexit
> Thread model: posix
> gcc version 4.4.7 (GCC)
> 
> patch#21486 is required before this patch.  grep will speed up by this
> patch additionaly.

I updated the patch due to change in bug#21486, and added a patch
including a minor change.
[0001-dfa-use-algorithm-for-single-byte-character-to-any-s.patch (text/plain, attachment)]
[0002-dfa-avoid-invalid-character-matches-period.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23932; Package grep. (Wed, 17 Aug 2016 13:06:02 GMT) Full text and rfc822 format available.

Message #13 received at 23932 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 23932 <at> debbugs.gnu.org
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Wed, 17 Aug 2016 22:05:09 +0900
[Message part 1 (text/plain, inline)]
On Tue, 16 Aug 2016 23:35:22 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> I updated the patch due to change in bug#21486, and added a patch
> including a minor change.

I wrote third patch.  After first patch, we do not have to separate next
state by context, transit_state() never treats a eol-byte as current
context.  So remove it.
[0003-dfa-remove-separation-by-context-in-transition-in-no.patch (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 01 Sep 2016 18:51:01 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Thu, 01 Sep 2016 18:51:02 GMT) Full text and rfc822 format available.

Message #18 received at 23932-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 23932-done <at> debbugs.gnu.org
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Thu, 1 Sep 2016 11:49:59 -0700
[Message part 1 (text/plain, inline)]
Thanks for that set of patches too. I rebased it and tweaked NEWS and installed 
the resulting patch set (attached) into Savannah master.
[0001-dfa-use-single-byte-algorithm-even-in-non-UTF-8.patch (text/x-diff, attachment)]
[0002-dfa-avoid-invalid-character-matching-period.patch (text/x-diff, attachment)]
[0003-dfa-document-previous-change.patch (text/x-diff, attachment)]
[0004-dfa-remove-separation-by-context-in-transition-in-no.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23932; Package grep. (Fri, 02 Sep 2016 03:27:02 GMT) Full text and rfc822 format available.

Message #21 received at 23932 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 23932 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, 
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Thu, 1 Sep 2016 20:26:22 -0700
On Thu, Sep 1, 2016 at 11:49 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Thanks for that set of patches too. I rebased it and tweaked NEWS and
> installed the resulting patch set (attached) into Savannah master.

Nice. Thanks to both of you for all that work.




Information forwarded to bug-grep <at> gnu.org:
bug#23932; Package grep. (Sat, 03 Sep 2016 01:28:02 GMT) Full text and rfc822 format available.

Message #24 received at 23932-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 23932-done <at> debbugs.gnu.org
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Sat, 03 Sep 2016 10:27:40 +0900
[Message part 1 (text/plain, inline)]
On Thu, 1 Sep 2016 11:49:59 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Thanks for that set of patches too. I rebased it and tweaked NEWS and installed the resulting patch set (attached) into Savannah master.

Thanks for reviewing and installing.  BTW, I seem that you lost a part
of my proposition on rebase.  If it is not intentional, would you review
the part again?
[0001-dfa-additional-change-for-use-single-byte-algorithm-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23932; Package grep. (Sat, 03 Sep 2016 03:01:02 GMT) Full text and rfc822 format available.

Message #27 received at 23932-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 23932-done <at> debbugs.gnu.org
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Fri, 2 Sep 2016 20:00:12 -0700
Norihiro Tanaka wrote:

> I seem that you lost a part
> of my proposition on rebase.  If it is not intentional, would you review
> the part again?

Thanks for catching that; it was not intentional. I installed your patch.





Information forwarded to bug-grep <at> gnu.org:
bug#23932; Package grep. (Sat, 03 Sep 2016 03:06:01 GMT) Full text and rfc822 format available.

Message #30 received at 23932-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 23932-done <at> debbugs.gnu.org
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Sat, 03 Sep 2016 12:05:35 +0900
On Fri, 2 Sep 2016 20:00:12 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> 
> > I seem that you lost a part
> > of my proposition on rebase.  If it is not intentional, would you review
> > the part again?
> 
> Thanks for catching that; it was not intentional. I installed your patch.

Thanks for catching up.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 01 Oct 2016 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 264 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.