GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #149 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 19 Dec 2014 23:00:38 +0900
On Thu, 18 Dec 2014 14:45:58 +0100
Vincent Lefevre <vincent <at> vinc17.net> wrote:
> > 
> >   0xE0 0xC2 0xFF
> >   0xED 0xA0 0xFF
> >   0xF0 0xBF 0xFF 0xFF
> 
> If I'm not mistaken, these first three are also treated as invalid by
> my patch (and should be treated as invalid by any tool).

I got them from pcre_valid_utf8(), but I made some mistakes.  They are
as following.

  0xE0 0xAF 0xBF
  0xED 0xA0 0xBF
  0xF0 0x8F 0xBF 0xBF

By the way, they are correspond with following codes in pcre_valid_utf8().

    if (c == 0xe0 && (d & 0x20) == 0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR16;
      }
    if (c == 0xed && d >= 0xa0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR14;
      }

    ........

    if (c == 0xf0 && (d & 0x30) == 0)
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR17;
      }
    if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR13;
      }

> BTW,
> 
>   printf "\xF4\xBF\xBF\xBF\n" | grep .
> 
> finds a match, and this appears to be a bug (grep should follow
> the current standard).

I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
for the sequence.





This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.