GNU bug report logs - #23932
dfa: use algorithm for single byte character to any single byte character in input text always

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sun, 10 Jul 2016 09:53:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: bug#23932: closed (Re: bug#23932: dfa: use algorithm for single
 byte character to any single byte character in input text always)
Date: Thu, 01 Sep 2016 18:51:02 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#23932: dfa: use algorithm for single byte character to any single byte character in input text always

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 23932 <at> debbugs.gnu.org.

-- 
23932: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=23932
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 23932-done <at> debbugs.gnu.org
Subject: Re: bug#23932: dfa: use algorithm for single byte character to any
 single byte character in input text always
Date: Thu, 1 Sep 2016 11:49:59 -0700
[Message part 3 (text/plain, inline)]
Thanks for that set of patches too. I rebased it and tweaked NEWS and installed 
the resulting patch set (attached) into Savannah master.
[0001-dfa-use-single-byte-algorithm-even-in-non-UTF-8.patch (text/x-diff, attachment)]
[0002-dfa-avoid-invalid-character-matching-period.patch (text/x-diff, attachment)]
[0003-dfa-document-previous-change.patch (text/x-diff, attachment)]
[0004-dfa-remove-separation-by-context-in-transition-in-no.patch (text/x-diff, attachment)]
[Message part 8 (message/rfc822, inline)]
From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: dfa: use algorithm for single byte character to any single byte
 character in input text always
Date: Sun, 10 Jul 2016 18:51:43 +0900
[Message part 9 (text/plain, inline)]
In multibyte locales, if a pattern start with period expression,
matching is still slow, as transition table is built at run time,
even when next character is single byte in input text.

This patch changes it into as use algorithm for single byte character to
any single byte character in input text always.  If transition table has
been built already and a next character in input text is single byte,
transit to next state by reference of only pre-built transition table,
even if from a state including ANYCHAR.

$ yes "$(printf 'a%038db\n' 0)" | head -1000000 >in
$ env LC_ALL=C gcc -v
Reading specs from /usr/local/lib/gcc/x86_64-pc-linux-gnu/4.4.7/specs
Target: x86_64-pc-linux-gnu
Configured with: ./configure --with-as=/usr/local/bin/as --with-ld=/usr/local/bin/ld --with-system-zlib --enable-__cxa_atexit
Thread model: posix
gcc version 4.4.7 (GCC)

patch#21486 is required before this patch.  grep will speed up by this
patch additionaly.

[grep-2.25]
$ time -p env LC_ALL=ja_JP.eucjp grep .a.b in
real 4.78
user 4.42
sys 0.16
$ time -p env LC_ALL=ja_JP.eucjp grep '.\{41\}' in
real 46.23
user 43.98
sys 0.21

[after patch#21486]
$ time -p env LC_ALL=ja_JP.eucjp src/grep .a.b in
real 1.26
user 1.08
sys 0.08
$ time -p env LC_ALL=ja_JP.eucjp src/grep '.\{41\}' in
real 1.14
user 1.00
sys 0.10

[after this patch]
$ time -p env LC_ALL=ja_JP.eucjp src/grep .a.b in
real 0.47
user 0.36
sys 0.07
$ time -p env LC_ALL=ja_JP.eucjp src/grep '.\{41\}' in
real 0.24
user 0.18
sys 0.05

[locale C (ref.)]
$ time -p env LC_ALL=C src/grep .a.b in
real 0.23
user 0.11
sys 0.09
$ time -p env LC_ALL=C src/grep '.\{41\}' in
real 0.22
user 0.13
sys 0.06
[0001-dfa-use-algorithm-for-single-byte-character-to-any-s.patch (text/plain, attachment)]

This bug report was last modified 8 years and 265 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.