GNU bug report logs -
#18777
[PATCH] dfa: improvement for checking of multibyte character boundary
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Mon, 20 Oct 2014 15:05:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
On 10/20/2014 09:04 AM, Norihiro Tanaka wrote:
> This patch improves performance for input string which doesn't match
> even the first part of a pattern. Although there is no less effective
> for grep as it uses a superset of DFA, gawk speeds up about 40%.
>
>
> When found newline, we can skip check of a multibyte character boundary
> before the character, as we assume newline as a single byte character.
> by that.
POSIX requires that NUL, slash, dot, newline, and carriage return all be
single bytes that cannot occur inside a multibyte character (because
they have special meaning to file name resolution and/or terminal
interaction); it added this requirement fairly recently, but only after
confirming that common existing locales satisfy this constraint. (The
same is not true for most any other character; even though POSIX
requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid
those characters from also being bytes embedded within multibyte
characters). Is it worth extending your optimization to all five of the
POSIX-guaranteed single byte characters?
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
This bug report was last modified 9 years and 76 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.