#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #146 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net> To: Norihiro Tanaka <noritnk <at> kcn.ne.jp> Cc: 18454 <at> debbugs.gnu.org Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Thu, 18 Dec 2014 14:45:58 +0100

Sorry for the late reply. On 2014-11-29 11:58:48 +0900, Norihiro Tanaka wrote: > On Fri, 28 Nov 2014 16:50:29 +0100 > Vincent Lefevre <vincent <at> vinc17.net> wrote: > > What matters is whether a sequence corresponds to a valid UTF-8 > > encoded Unicode character. My patch ensures that pcre_exec is called > > on a string with only such characters, which implies that this is > > also valid UTF-8 for PCRE (whether Unicode validity is also considered > > in valid_utf8() or not). So, there's no valid reason why grep would > > crash under such a condition. > > It seems that PCRE treats e.g. following character as invalid. It means > we should not these characters into pcre_exec with PCRE_NO_UTF8_CHECK > option. > > 0xE0 0xC2 0xFF > 0xED 0xA0 0xFF > 0xF0 0xBF 0xFF 0xFF If I'm not mistaken, these first three are also treated as invalid by my patch (and should be treated as invalid by any tool). > 0xF4 0xBF 0xBF 0xBF (corresponding to U+0013ffff). Well, I followed some comment in the grep source, which is currently incorrect. pcreunicode(3) specifies that it follows RFC 3629, and that only values in the range U+0 to U+10FFFF, excluding the surrogate area, are allowed. I'll try to update my patch. But IMHO, it would be better to get PCRE improved, and I had opened a bug: http://bugs.exim.org/show_bug.cgi?id=1554 BTW, printf "\xF4\xBF\xBF\xBF\n" | grep . finds a match, and this appears to be a bug (grep should follow the current standard). -- Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales