GNU bug report logs -
#18454
Improve performance when -P (PCRE) is used in UTF-8 locales
Previous Next
Full log
View this message in rfc822 format
Zoltán Herczeg wrote:
> For me the question is whether binary search needs to supported on PCRE level.
It's purely a performance question. GNU grep already uses libpcre to search
binary data, and it works now. It's just slow, that all. I'm willing to live
with this, and tell users "Sorry, but libpcre is not designed to search binary
data quickly; if you want speed then don't use grep's -P option." If you're
willing to live with this too, we're done.
> removing a lot of optimizations.
You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK case.
Keep them all. It should be just as fast before. The idea is to have one
matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and another
matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as it
goes). The former matcher will be just as fast as now, and the latter matcher
will be faster than what libpcre has now. I readily concede that this will
require some nontrivial coding, but I don't concede that it will remove
optimizations or make libpcre slower. It should make libpcre faster; that's the
point.
> You have a 100 byte long buffer, and you start matching from byte 50.
Grep already does that sort of thing. And it's smart enough to start matching
only at character boundaries. It's not libpcre's job to worry about this; the
caller can worry about it.
> For me this is way too much checks, and affects compiler optimizations too much.
The code you posted could be made faster than that; among other things there
should not be an unbounded backward scan. And even the code you posted would
often be faster than what's in libpcre now. That early UTF-8 validity prepass
is a killer.
This bug report was last modified 3 years and 181 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.