GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #140 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 28 Nov 2014 16:50:29 +0100
On 2014-11-28 23:31:49 +0900, Norihiro Tanaka wrote:
> Thanks for the patch.  However, I seem that valid_utf() in PCRE also
> considers 5 and 6 bytes characters in PCRE.

In any case, even if PCRE considers these sequences as valid UTF-8,
they shouldn't match because they are not part of Unicode (if they
can match, this would be a bug in libpcre). My patch considers that
these sequences do not match, which is consistent with the expected
behavior.

> IMHO, We assume that grep doesn't know how to check for an input text in
> valid_utf(), althouth we know PCRE checks whether an input text is valid
> utf8 or not, so that even when PCRE changes behaviour of valid_utf(),
> grep should run.
> 
> If we do not check invalid utf8 characters with valid_utf8() in advance,
> grep may cause core dump with PCRE_NO_UTF8_CHECK.
> See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586
> 
> So we can not avoid for checking invalid utf8 characters with valid_utf8().
> Further more, we must perform to check as PCRE expects, but grep does
> not know how to PCRE to check invalid_utf8 characters due to an above
> assumption.

What matters is whether a sequence corresponds to a valid UTF-8
encoded Unicode character. My patch ensures that pcre_exec is called
on a string with only such characters, which implies that this is
also valid UTF-8 for PCRE (whether Unicode validity is also considered
in valid_utf8() or not). So, there's no valid reason why grep would
crash under such a condition.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.