GNU bug report logs -
#18777
[PATCH] dfa: improvement for checking of multibyte character boundary
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Mon, 20 Oct 2014 15:05:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
On Mon, 15 Dec 2014 09:43:54 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Can't we improve this when using_utf8 () is true? In that case, every
> ASCII character is always single byte. Also, the bytes 0xc0, 0xc1,
> and 0xf5 through 0xff can be added to the table: they are not
> single-byte characters but they are always encoding errors so they will
> be a character boundary as far as skip_remains_mb is concerned. This
> suggests that the table 'always_single_byte' should be renamed to
> something like 'always_character_boundary'.
>
> > wint_t wc = WEOF;
> > + if (always_single_byte[*p])
> > + return p;
Thanks for the review and suggestion. If using_utf8 () is true, we can
set always_character_boundary to true except 0x80-0xbf.
> This won't assign anything to *WCP, contrary to the documented API for
> for skip_remains_mb. This is OK (as callers don't care) but the API
> documentation should be changed to reflect the actual behavior.
Oh! if WCP is needed, we must be go through step by step, as a wide
character before P is set to *WCP. I fixed it and updated the API
documentation.
[0001-dfa-improvement-for-checking-of-multibyte-character-.patch (text/plain, attachment)]
This bug report was last modified 9 years and 77 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.