#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Vincent Lefevre <vincent <at> vinc17.net> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: 18454 <at> debbugs.gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Fri, 12 Sep 2014 10:20:08 +0200

On 2014-09-11 19:53:23 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >Things could be done in grep: > > > >1. Ignore -P when the pattern would have the same meaning without -P > > (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b", > > at least for the simplest cases). > > > >2. Call PCRE in the C locale when this is equivalent. > > I had already considered these ideas along with several others, but they > would require grep to parse and analyze the Perl regular expression. I > don't know the PCRE syntax and it would take some time to write a parser. > And even if I wrote one, the next PCRE release would likely change the > syntax. It sounds very painful to maintain. I think that (1) is rather simple, even though optimization could be missed on some patterns: ERE and PCRE have a large equivalent subclass. The pattern could be examined left to right and would consist of: - Normal characters. - ".", "^" at the beginning, "$" (alone) at the end. - [] with normal characters inside. - "*", "+", "?", "{...}" form not followed by one of "*+?{". - "|" and "(" not followed by one of these 4 characters. - "\" followed by one of ".^$[*+?{". - Some "\" + letter sequences could be recognised as well. Something like that (I haven't checked carefully). There could be another option to allow such an optimization or not. > >3. Transform invalid bytes to null bytes in-place before the PCRE > > call. This changes the current semantic, but: > > * the semantic on invalid bytes has never been specified, AFAIK; > > * the best *practical* behavior may not be the current one > > As we've already discussed, this would be incompatible with how invalid > bytes are treated by other matchers. The same thing could be done with other matchers (in an optional way). > And would have undesirable practical effects, e.g., the pattern > 'a..*b' would match data that would look like "ab" on many screens > (because the null byte would vanish). It's a real kludge that will > bite users. But this is already the case: $ printf "a\0b\n" ab $ printf "a\0b\n" | grep 'a..*b' Binary file (standard input) matches The transformation won't touch null bytes. It would just interpret invalid bytes as null bytes, so that they get matched by ".". > Even if we went along with the kludge, grep does not know what bytes > PCRE considers to be invalid without invoking PCRE, which is what > it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are > different ways to do that and they don't all agree.) It would be a bug in PCRE. Parsing UTF-8 is standard. This is sumarized by: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx (from the Linux utf-8(7) man page), everything else being invalid. Note that the pcre_exec(3) man page even say: PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8 validity assuming that the check can be done on the user's side, i.e. in a standard way. > Here's a different idea. How about invoking grep with the > --binary-files=without-match option? This should avoid much of the > libpcre performance problem, without having to change 'grep'. I often want to take binary files into account, for instance because executables can contain text I search for (error messages...). There may be other examples. -- Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales