GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: 18454 <at> debbugs.gnu.org
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sun, 28 Sep 2014 08:09:33 -0700
Zoltán Herczeg wrote:

> For me the question is whether binary search needs to supported on PCRE level.

It's purely a performance question.  GNU grep already uses libpcre to search 
binary data, and it works now.  It's just slow, that all.  I'm willing to live 
with this, and tell users "Sorry, but libpcre is not designed to search binary 
data quickly; if you want speed then don't use grep's -P option."  If you're 
willing to live with this too, we're done.

> removing a lot of optimizations.

You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK case. 
 Keep them all.  It should be just as fast before.  The idea is to have one 
matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and another 
matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as it 
goes).  The former matcher will be just as fast as now, and the latter matcher 
will be faster than what libpcre has now.  I readily concede that this will 
require some nontrivial coding, but I don't concede that it will remove 
optimizations or make libpcre slower.  It should make libpcre faster; that's the 
point.

> You have a 100 byte long buffer, and you start matching from byte 50.

Grep already does that sort of thing.  And it's smart enough to start matching 
only at character boundaries.  It's not libpcre's job to worry about this; the 
caller can worry about it.

> For me this is way too much checks, and affects compiler optimizations too much.

The code you posted could be made faster than that; among other things there 
should not be an unbounded backward scan.  And even the code you posted would 
often be faster than what's in libpcre now.  That early UTF-8 validity prepass 
is a killer.





This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.