GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #65 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>, 
 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 25 Sep 2014 18:19:20 -0700
Zoltán, thanks for your comments on this subject.  Some thoughts and 
suggestions:

> - what should you do if you encounter an invalid UTF-8 opcode

Do whatever plain 'grep' does, which is what the glibc regular 
expression matcher does.  If I recall correctly, an encoding error in 
the pattern matches the same encoding error in the string.  It shouldn't 
be that complicated.

> Everybody has different opinion about handling invalid UTF opcodes

I doubt whether users would care all that much, so long as the default 
is reasonable.  We don't get complaints about it with 'grep', anyway. 
But if it's a real problem in the PCRE world, you could provide 
compile-time or run-time options to satisfy the different opinions.

> everybody would suffer this performance regression, including those, who pass valid UTF strings.

I don't see why.  libpcre can continue with its current implementation, 
for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
that's not a problem.  The problem is the case where users pass 
possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
a slow implementation for this case, and this slow implementation's 
performance should be improvable without affecting the performance for 
the PCRE_NO_UTF8_CHECK case.

> * The best solution is multi-threaded grepping

That would chew up CPU resources unnecessarily, by requiring two passes 
over the input, one for checking UTF-8, the other for doing the actual 
match.  Granted, it might be faster in real-time than what we have now, 
but overall it'd probably be more expensive (e.g., more energy 
consumption) than what we have now, and this doesn't sound promising.

> * The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed

That doesn't sound like a win, I'm afraid.  The use case that prompted 
this bug report is someone using 'grep -r' to search for strings like 
'foobar' in binary data, and this use case would not work with this 
suggested solution.


I'm hoping that the recent set of changes to 'grep' lessens the urgency 
of improving libpcre.  On my platform (Fedora 20 x86-64) Jim Meyering's 
benchmark <http://bugs.gnu.org/18454#56> says that with grep 2.18, grep 
-P is 6.4x slower than plain grep, and that with the latest experimental 
grep (including the patches I just posted in 
<http://bugs.gnu.org/18454#62>), grep -P is 5.6x slower than plain grep. 
 So it's plausible that the latest set of fixes is good enough, in the 
sense that, sure, PCRE is slower, but it's always been slower and if 
that used to be good enough then it should still be good enough.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.