#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Zoltán Herczeg <hzmester <at> freemail.hu> Cc: 18454 <at> debbugs.gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Sun, 28 Sep 2014 08:09:33 -0700

Zoltán Herczeg wrote: > For me the question is whether binary search needs to supported on PCRE level. It's purely a performance question. GNU grep already uses libpcre to search binary data, and it works now. It's just slow, that all. I'm willing to live with this, and tell users "Sorry, but libpcre is not designed to search binary data quickly; if you want speed then don't use grep's -P option." If you're willing to live with this too, we're done. > removing a lot of optimizations. You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK case. Keep them all. It should be just as fast before. The idea is to have one matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and another matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as it goes). The former matcher will be just as fast as now, and the latter matcher will be faster than what libpcre has now. I readily concede that this will require some nontrivial coding, but I don't concede that it will remove optimizations or make libpcre slower. It should make libpcre faster; that's the point. > You have a 100 byte long buffer, and you start matching from byte 50. Grep already does that sort of thing. And it's smart enough to start matching only at character boundaries. It's not libpcre's job to worry about this; the caller can worry about it. > For me this is way too much checks, and affects compiler optimizations too much. The code you posted could be made faster than that; among other things there should not be an unbounded backward scan. And even the code you posted would often be faster than what's in libpcre now. That early UTF-8 validity prepass is a killer.

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales