GNU bug report logs -
#18454
Improve performance when -P (PCRE) is used in UTF-8 locales
Previous Next
Full log
Message #65 received at 18454 <at> debbugs.gnu.org (full text, mbox):
Zoltán, thanks for your comments on this subject. Some thoughts and
suggestions:
> - what should you do if you encounter an invalid UTF-8 opcode
Do whatever plain 'grep' does, which is what the glibc regular
expression matcher does. If I recall correctly, an encoding error in
the pattern matches the same encoding error in the string. It shouldn't
be that complicated.
> Everybody has different opinion about handling invalid UTF opcodes
I doubt whether users would care all that much, so long as the default
is reasonable. We don't get complaints about it with 'grep', anyway.
But if it's a real problem in the PCRE world, you could provide
compile-time or run-time options to satisfy the different opinions.
> everybody would suffer this performance regression, including those, who pass valid UTF strings.
I don't see why. libpcre can continue with its current implementation,
for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK;
that's not a problem. The problem is the case where users pass
possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK. libpcre has
a slow implementation for this case, and this slow implementation's
performance should be improvable without affecting the performance for
the PCRE_NO_UTF8_CHECK case.
> * The best solution is multi-threaded grepping
That would chew up CPU resources unnecessarily, by requiring two passes
over the input, one for checking UTF-8, the other for doing the actual
match. Granted, it might be faster in real-time than what we have now,
but overall it'd probably be more expensive (e.g., more energy
consumption) than what we have now, and this doesn't sound promising.
> * The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed
That doesn't sound like a win, I'm afraid. The use case that prompted
this bug report is someone using 'grep -r' to search for strings like
'foobar' in binary data, and this use case would not work with this
suggested solution.
I'm hoping that the recent set of changes to 'grep' lessens the urgency
of improving libpcre. On my platform (Fedora 20 x86-64) Jim Meyering's
benchmark <http://bugs.gnu.org/18454#56> says that with grep 2.18, grep
-P is 6.4x slower than plain grep, and that with the latest experimental
grep (including the patches I just posted in
<http://bugs.gnu.org/18454#62>), grep -P is 5.6x slower than plain grep.
So it's plausible that the latest set of fixes is good enough, in the
sense that, sure, PCRE is slower, but it's always been slower and if
that used to be good enough then it should still be good enough.
This bug report was last modified 3 years and 181 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.