GNU bug report logs -
#18454
Improve performance when -P (PCRE) is used in UTF-8 locales
Previous Next
Full log
Message #8 received at 18454 <at> debbugs.gnu.org (full text, mbox):
Vincent Lefevre wrote:
> Things could be done in grep:
>
> 1. Ignore -P when the pattern would have the same meaning without -P
> (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
> at least for the simplest cases).
>
> 2. Call PCRE in the C locale when this is equivalent.
I had already considered these ideas along with several others, but they
would require grep to parse and analyze the Perl regular expression. I
don't know the PCRE syntax and it would take some time to write a
parser. And even if I wrote one, the next PCRE release would likely
change the syntax. It sounds very painful to maintain.
> 3. Transform invalid bytes to null bytes in-place before the PCRE
> call. This changes the current semantic, but:
> * the semantic on invalid bytes has never been specified, AFAIK;
> * the best *practical* behavior may not be the current one
As we've already discussed, this would be incompatible with how invalid
bytes are treated by other matchers. And would have undesirable
practical effects, e.g., the pattern 'a..*b' would match data that would
look like "ab" on many screens (because the null byte would vanish).
It's a real kludge that will bite users.
Even if we went along with the kludge, grep does not know what bytes
PCRE considers to be invalid without invoking PCRE, which is what it's
doing now. (Yes, PCRE says it's parsing UTF-8, but there are different
ways to do that and they don't all agree.) I suppose grep could
reengineer libpcre's internals, to exactly duplicate the algorithm that
libpcre uses to decide when bytes are invalid (except to do it 10X
faster :-), but then that'd be another thing to maintain in parallel
with libpcre.
All of these changes sound like a lot of work, which nobody is willing
to do.
Here's a different idea. How about invoking grep with the
--binary-files=without-match option? This should avoid much of the
libpcre performance problem, without having to change 'grep'.
This bug report was last modified 3 years and 181 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.