GNU bug report logs -
#18777
[PATCH] dfa: improvement for checking of multibyte character boundary
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Mon, 20 Oct 2014 15:05:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #23 received at 18777 <at> debbugs.gnu.org (full text, mbox):
Hi.
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> arnold <at> skeeve.com wrote:
> > I would think adding a check for '\r' would be safe and would help
> > too; given that on Windows systems '\r' generally occurs just as
> > frequently as '\n', it should give a nice speedup for gawk on those
> > systems.
>
> As I recognize that DFA and regex aren't support multiple eolbytes as
> CR-LF, I can't understand where we can use the change. Grep converts
> Windows text to Unix text by removal of CR in advance.
Gawk does not remove CR in advance, unless someone specifically
set RS = "\r\n", in which case the full regex matcher is used
to first find \r\n in the raw input buffer.
So for gawk, adding a check for (c == eolbyte || c == '\r')
should produce more speedup on Windows.
(Hmm, on Windows the default is probably text mode which causes
the library/OS to hide the \r anway. Harumph. But if binary mode
wsa requested then it could still make a difference.)
> BTW, although I say `newline', correctly notice that it's `eolbyte'
> which mayn't be either LF or NUL.
Understood and agreed.
Adding a check for \r isn't a big deal in any case, but of the 5
characters Erik mentioned originally, that is the only one where I
see a potential for a check to really make a difference.
Thanks!
Arnold
This bug report was last modified 9 years and 76 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.