#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #65 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Zoltán Herczeg <hzmester <at> freemail.hu>, 18454 <at> debbugs.gnu.org Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Thu, 25 Sep 2014 18:19:20 -0700

Zoltán, thanks for your comments on this subject. Some thoughts and suggestions: > - what should you do if you encounter an invalid UTF-8 opcode Do whatever plain 'grep' does, which is what the glibc regular expression matcher does. If I recall correctly, an encoding error in the pattern matches the same encoding error in the string. It shouldn't be that complicated. > Everybody has different opinion about handling invalid UTF opcodes I doubt whether users would care all that much, so long as the default is reasonable. We don't get complaints about it with 'grep', anyway. But if it's a real problem in the PCRE world, you could provide compile-time or run-time options to satisfy the different opinions. > everybody would suffer this performance regression, including those, who pass valid UTF strings. I don't see why. libpcre can continue with its current implementation, for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; that's not a problem. The problem is the case where users pass possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK. libpcre has a slow implementation for this case, and this slow implementation's performance should be improvable without affecting the performance for the PCRE_NO_UTF8_CHECK case. > * The best solution is multi-threaded grepping That would chew up CPU resources unnecessarily, by requiring two passes over the input, one for checking UTF-8, the other for doing the actual match. Granted, it might be faster in real-time than what we have now, but overall it'd probably be more expensive (e.g., more energy consumption) than what we have now, and this doesn't sound promising. > * The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed That doesn't sound like a win, I'm afraid. The use case that prompted this bug report is someone using 'grep -r' to search for strings like 'foobar' in binary data, and this use case would not work with this suggested solution. I'm hoping that the recent set of changes to 'grep' lessens the urgency of improving libpcre. On my platform (Fedora 20 x86-64) Jim Meyering's benchmark <http://bugs.gnu.org/18454#56> says that with grep 2.18, grep -P is 6.4x slower than plain grep, and that with the latest experimental grep (including the patches I just posted in <http://bugs.gnu.org/18454#62>), grep -P is 5.6x slower than plain grep. So it's plausible that the latest set of fixes is good enough, in the sense that, sure, PCRE is slower, but it's always been slower and if that used to be good enough then it should still be good enough.

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales