#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #8 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Thu, 11 Sep 2014 19:53:23 -0700

Vincent Lefevre wrote: > Things could be done in grep: > > 1. Ignore -P when the pattern would have the same meaning without -P > (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b", > at least for the simplest cases). > > 2. Call PCRE in the C locale when this is equivalent. I had already considered these ideas along with several others, but they would require grep to parse and analyze the Perl regular expression. I don't know the PCRE syntax and it would take some time to write a parser. And even if I wrote one, the next PCRE release would likely change the syntax. It sounds very painful to maintain. > 3. Transform invalid bytes to null bytes in-place before the PCRE > call. This changes the current semantic, but: > * the semantic on invalid bytes has never been specified, AFAIK; > * the best *practical* behavior may not be the current one As we've already discussed, this would be incompatible with how invalid bytes are treated by other matchers. And would have undesirable practical effects, e.g., the pattern 'a..*b' would match data that would look like "ab" on many screens (because the null byte would vanish). It's a real kludge that will bite users. Even if we went along with the kludge, grep does not know what bytes PCRE considers to be invalid without invoking PCRE, which is what it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are different ways to do that and they don't all agree.) I suppose grep could reengineer libpcre's internals, to exactly duplicate the algorithm that libpcre uses to decide when bytes are invalid (except to do it 10X faster :-), but then that'd be another thing to maintain in parallel with libpcre. All of these changes sound like a lot of work, which nobody is willing to do. Here's a different idea. How about invoking grep with the --binary-files=without-match option? This should avoid much of the libpcre performance problem, without having to change 'grep'.

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales