GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18454 <at> debbugs.gnu.org
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Fri, 26 Sep 2014 08:36:40 +0200 (CEST)
Hi Paul,

thank you for the feedback.

>I doubt whether users would care all that much, so long as the default 
>is reasonable.  We don't get complaints about it with 'grep', anyway. 
>But if it's a real problem in the PCRE world, you could provide 
>compile-time or run-time options to satisfy the different opinions.

The situation is worse :( Reasonable has a different meaning for everybody.

Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint:

/(?<=\x9c)#/

Does it match \xd5\x9c# starting from #? Noticing errors during a backward scan is complicated.

/[\x9c-\x{ffff}]/

What does this range defines exactly? What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?

Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence? In general, UTF defines several character properties. What unicode properties does an invalid codepoint have?

Believe me, depending on their needs, everybody has different answers to these questions. We don't want to force the view of one group to others, since PCRE is a library. You have much more freedom to define any behavior, since grep is an end-user program.

>I don't see why.  libpcre can continue with its current implementation, 
>for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
>that's not a problem.  The problem is the case where users pass 
>possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
>a slow implementation for this case, and this slow implementation's 
>performance should be improvable without affecting the performance for 
>the PCRE_NO_UTF8_CHECK case.

Regarding performance, this comes from the interpreter:

#define GETUTF8(c, eptr) \
    { \
    if ((c & 0x20) == 0) \
      c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
    else if ((c & 0x10) == 0) \
      c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
    else if ((c & 0x08) == 0) \
      c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
      ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
    else if ((c & 0x04) == 0) \
      c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
          ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
          (eptr[4] & 0x3f); \
    else \
      c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
          ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
          ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
    }

Imagine if you would need to add buffer end and other bit checks. Furthermore unicode expects that any character should be encoded with the least amount of bytes. More checks. You also need to check the current mode. Of course we have several macros similar like this (due to performance reasons), and there are code paths where we have assumptions about valid UTF strings. This would increase complexity a lot, we would need a lot of extra regression tests, we need a correct JIT implementation, and so on. This would also kill optimizations. For example, if you define a character range, where all characters are two byte long, JIT cleverly detect this, and use a fast case to discard any non-two byte UTF codepoints.

The question is, who would be willing to do this work.

>That would chew up CPU resources unnecessarily, by requiring two passes 
>over the input, one for checking UTF-8, the other for doing the actual 
>match.  Granted, it might be faster in real-time than what we have now, 
>but overall it'd probably be more expensive (e.g., more energy 
>consumption) than what we have now, and this doesn't sound promising.

Yeah but you could add a flag to enable this :) I feel this would be much less work than the former.

>That doesn't sound like a win, I'm afraid.  The use case that prompted 
>this bug report is someone using 'grep -r' to search for strings like 
>'foobar' in binary data, and this use case would not work with this 
>suggested solution.

In this case, I would simply disable UTF-8 decoding. You could search UTF character codes encoded in ascii (i.e. searching \xe9 as \xc3\xa9)

Regards,
Zoltan





This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.