GNU bug report logs -
#18454
Improve performance when -P (PCRE) is used in UTF-8 locales
Previous Next
Full log
View this message in rfc822 format
On 2014-12-19 23:00:38 +0900, Norihiro Tanaka wrote:
> I got them from pcre_valid_utf8(), but I made some mistakes. They are
> as following.
>
> 0xE0 0xAF 0xBF
This one is valid UTF-8 and corresponds to the code point U+0BFF, and
the following matches:
$ printf "\xE0\xAF\xBF\n" | grep -P .
> 0xED 0xA0 0xBF
OK, this is in the surrogate area, and it doesn't match with PCRE.
> 0xF0 0x8F 0xBF 0xBF
This would be U+7FF4FFFF, larger than U+10FFFF.
> > BTW,
> >
> > printf "\xF4\xBF\xBF\xBF\n" | grep .
> >
> > finds a match, and this appears to be a bug (grep should follow
> > the current standard).
>
> I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1
> for the sequence.
Ditto with:
printf "\xED\xA0\xBF\n" | grep .
(surrogate area).
--
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
This bug report was last modified 3 years and 181 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.