GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #74 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 26 Sep 2014 01:48:00 -0700
Zoltán Herczeg wrote:

> Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint:
>
> /(?<=\x9c)#/
>
> Does it match \xd5\x9c# starting from #?

No, because the input does not contain a \x9c encoding error.  Encoding errors 
match only themselves, not parts of other characters.  That is how the glibc 
matchers behave, and it's what users expect.

> Noticing errors during a backward scan is complicated.

It's doable, and it's the right thing to do.

> /[\x9c-\x{ffff}]/
>
> What does this range defines exactly?

Range expressions have implementation-defined semantics in POSIX.  For PCRE you 
can do what you like.  I suggest mapping encoding-error bytes into characters 
outside the Unicode range; that's what Emacs does, I think, and it's simple and 
easy to explain to users.  It's not a big deal either way.

> What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?

Just treat encoding-error bytes like everything else.  In effect, extend the 
encoding to allow any byte sequence, and add a few "characters" outside the 
Unicode range, one for each invalid UTF-8 byte.

> Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence?

Sorry, I don't quite follow, but encoding errors aren't letters and don't have 
case.  They match only themselves.

> What unicode properties does an invalid codepoint have?

The minimal ones.

> depending on their needs, everybody has different answers to these questions.

That's fine.  Just implement reasonable defaults, and provide options if people 
have needs that differ from the defaults.  That's easier for libpcre than for 
grep, since libpcre users (who are programmers) can reasonably be expected to be 
more sophisticated about this sort of thing than grep users (who are not 
necessarily programmers).

> Imagine if you would need to add buffer end and other bit checks.

Of course it will be more expensive to check for UTF-8 as you go, than to assume 
the input is valid UTF-8.  But again, we're not talking about the 
PCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talking 
about the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether the 
input is valid UTF-8, and currently does so inefficiently.  In the 
non-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you go, 
than to have a prepass that checks for UTF-8.  This is because the prepass must 
be stupid (it must check the entire input buffer) whereas the matcher can be 
smart (it often can do its work without checking the entire input buffer).  This 
is one reason libpcre is slower than the glibc matchers.

Obviously it would be some work to build a libpcre that runs faster in the 
non-PCRE_NO_UTF8_CHECK case, without hurting performance in the 
PCRE_NO_UTF8_CHECK case.  But it could be done, if someone had the time to do it.

> The question is, who would be willing to do this work.

Not me.  :-)

>> That would chew up CPU resources unnecessarily

> Yeah but you could add a flag to enable this :)

I'm not sure it'd be popular to add a --drain-battery option to grep. :)

>> The use case that prompted
>> this bug report is someone using 'grep -r' to search for strings like
>> 'foobar' in binary data, and this use case would not work with this
>> suggested solution.
>
> In this case, I would simply disable UTF-8 decoding.

I suggested that already, but the user (e.g., see the last paragraph of 
<http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated 
UTF-8 patterns in binary data.  For example, I expect the user wants the pattern 
'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file.  So he can't 
simply use unibyte processing.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.