GNU bug report logs - #18777
[PATCH] dfa: improvement for checking of multibyte character boundary

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 20 Oct 2014 15:05:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Eric Blake <eblake <at> redhat.com>
Cc: 18777 <at> debbugs.gnu.org
Subject: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Date: Mon, 15 Dec 2014 09:43:54 -0800
On 12/15/2014 06:59 AM, Norihiro Tanaka wrote:
> +/* True if each byte can not occur inside a multibyte character  */
> +static bool always_single_byte[NOTCHAR];
> +
> +static void
> +dfaalwayssb (void)
> +{
> +  size_t i;
> +  unsigned char const uc[] = { '\0', '\n', '\r', '.', '/' };
> +  for (i = 0; i < sizeof uc / sizeof uc[0]; ++i)
> +    always_single_byte[uc[i]] = true;
> +}

Can't we improve this when using_utf8 () is true?  In that case, every 
ASCII character is always single byte.  Also, the bytes 0xc0, 0xc1, and 
0xf5 through 0xff can be added to the table: they are not single-byte 
characters but they are always encoding errors so they will be a 
character boundary as far as skip_remains_mb is concerned.  This 
suggests that the table 'always_single_byte' should be renamed to 
something like 'always_character_boundary'.

>     wint_t wc = WEOF;
> +  if (always_single_byte[*p])
> +    return p;

This won't assign anything to *WCP, contrary to the documented API for 
for skip_remains_mb.  This is OK (as callers don't care) but the API 
documentation should be changed to reflect the actual behavior.




This bug report was last modified 9 years and 74 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.