GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #8 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 11 Sep 2014 19:53:23 -0700
Vincent Lefevre wrote:
> Things could be done in grep:
>
> 1. Ignore -P when the pattern would have the same meaning without -P
>     (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
>     at least for the simplest cases).
>
> 2. Call PCRE in the C locale when this is equivalent.

I had already considered these ideas along with several others, but they 
would require grep to parse and analyze the Perl regular expression.  I 
don't know the PCRE syntax and it would take some time to write a 
parser.  And even if I wrote one, the next PCRE release would likely 
change the syntax.  It sounds very painful to maintain.

> 3. Transform invalid bytes to null bytes in-place before the PCRE
>     call. This changes the current semantic, but:
>     * the semantic on invalid bytes has never been specified, AFAIK;
>     * the best *practical* behavior may not be the current one

As we've already discussed, this would be incompatible with how invalid 
bytes are treated by other matchers.  And would have undesirable 
practical effects, e.g., the pattern 'a..*b' would match data that would 
look like "ab" on many screens (because the null byte would vanish). 
It's a real kludge that will bite users.

Even if we went along with the kludge, grep does not know what bytes 
PCRE considers to be invalid without invoking PCRE, which is what it's 
doing now.  (Yes, PCRE says it's parsing UTF-8, but there are different 
ways to do that and they don't all agree.)  I suppose grep could 
reengineer libpcre's internals, to exactly duplicate the algorithm that 
libpcre uses to decide when bytes are invalid (except to do it 10X 
faster :-), but then that'd be another thing to maintain in parallel 
with libpcre.

All of these changes sound like a lot of work, which nobody is willing 
to do.

Here's a different idea.  How about invoking grep with the 
--binary-files=without-match option?  This should avoid much of the 
libpcre performance problem, without having to change 'grep'.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.