GNU bug report logs -
#18777
[PATCH] dfa: improvement for checking of multibyte character boundary
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Mon, 20 Oct 2014 15:05:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
On 12/15/2014 06:59 AM, Norihiro Tanaka wrote:
> +/* True if each byte can not occur inside a multibyte character */
> +static bool always_single_byte[NOTCHAR];
> +
> +static void
> +dfaalwayssb (void)
> +{
> + size_t i;
> + unsigned char const uc[] = { '\0', '\n', '\r', '.', '/' };
> + for (i = 0; i < sizeof uc / sizeof uc[0]; ++i)
> + always_single_byte[uc[i]] = true;
> +}
Can't we improve this when using_utf8 () is true? In that case, every
ASCII character is always single byte. Also, the bytes 0xc0, 0xc1, and
0xf5 through 0xff can be added to the table: they are not single-byte
characters but they are always encoding errors so they will be a
character boundary as far as skip_remains_mb is concerned. This
suggests that the table 'always_single_byte' should be renamed to
something like 'always_character_boundary'.
> wint_t wc = WEOF;
> + if (always_single_byte[*p])
> + return p;
This won't assign anything to *WCP, contrary to the documented API for
for skip_remains_mb. This is OK (as callers don't care) but the API
documentation should be changed to reflect the actual behavior.
This bug report was last modified 9 years and 74 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.