GNU bug report logs - #18777
[PATCH] dfa: improvement for checking of multibyte character boundary

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 20 Oct 2014 15:05:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: arnold <at> skeeve.com
To: noritnk <at> kcn.ne.jp, eblake <at> redhat.com
Cc: 18777 <at> debbugs.gnu.org
Subject: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Date: Tue, 21 Oct 2014 00:23:07 -0600
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> Eric Blake <eblake <at> redhat.com> wrote:
> > Is it worth extending your optimization to all five of the
> > POSIX-guaranteed single byte characters?
>
> Thanks, but I don't want to perform it immediately.  DFA has already
> regarded newline as a single byte character, but hasn't others yet.  So,
> we may need to make many changes to handle invalid locales and sequences
> not to conform to the rule.  If we omitted that, It might be that limits
> are added to the locale to be able to apply DFA to.  Threfore, it should
> be performed carefully.

I would think adding a check for '\r' would be safe and would help
too; given that on Windows systems '\r' generally occurs just as
frequently as '\n', it should give a nice speedup for gawk on those
systems.

The other characters that Erik cited seem less like a big issue to me.

Thanks,

Arnold




This bug report was last modified 9 years and 75 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.