GNU bug report logs - #18777
[PATCH] dfa: improvement for checking of multibyte character boundary

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 20 Oct 2014 15:05:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eric Blake <eblake <at> redhat.com>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 18777 <at> debbugs.gnu.org
Subject: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Date: Mon, 20 Oct 2014 10:07:20 -0600
[Message part 1 (text/plain, inline)]
On 10/20/2014 09:04 AM, Norihiro Tanaka wrote:
> This patch improves performance for input string which doesn't match
> even the first part of a pattern.  Although there is no less effective
> for grep as it uses a superset of DFA, gawk speeds up about 40%.
> 

> 
> When found newline, we can skip check of a multibyte character boundary
> before the character, as we assume newline as a single byte character.
> by that.

POSIX requires that NUL, slash, dot, newline, and carriage return all be
single bytes that cannot occur inside a multibyte character (because
they have special meaning to file name resolution and/or terminal
interaction); it added this requirement fairly recently, but only after
confirming that common existing locales satisfy this constraint.  (The
same is not true for most any other character; even though POSIX
requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid
those characters from also being bytes embedded within multibyte
characters).  Is it worth extending your optimization to all five of the
POSIX-guaranteed single byte characters?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 9 years and 76 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.