GNU bug report logs - #18777
[PATCH] dfa: improvement for checking of multibyte character boundary

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 20 Oct 2014 15:05:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: arnold <at> skeeve.com
Cc: eblake <at> redhat.com, 18777 <at> debbugs.gnu.org
Subject: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Date: Thu, 23 Oct 2014 00:28:35 +0900

arnold <at> skeeve.com wrote:
> Gawk does not remove CR in advance, unless someone specifically
> set RS = "\r\n", in which case the full regex matcher is used
> to first find \r\n in the raw input buffer.

Thanks, I also confirmed it on source code of Gawk.

> So for gawk, adding a check for (c == eolbyte || c == '\r')
> should produce more speedup on Windows.
> 
> (Hmm, on Windows the default is probably text mode which causes
> the library/OS to hide the \r anway. Harumph.  But if binary mode
> wsa requested then it could still make a difference.)

I think It's better to build KWset rather than rely on checking for '\r'
in non-UTF8 multibyte mode of DFA.

Further more, even if we add checking for '\r' to DFA, I think that we
can't use to speed up on Windows, so that DFA can't correctly locate a matched
position except a pattern which is fixed string.

This bug report was last modified 9 years and 74 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18777 [PATCH] dfa: improvement for checking of multibyte character boundary

GNU bug report logs - #18777
[PATCH] dfa: improvement for checking of multibyte character boundary