GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #180 received at 18454-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 18454-done <at> debbugs.gnu.org
Cc: Santiago Ruano Rincón <santiago <at> debian.org>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Jim Meyering <jim <at> meyering.net>,
 Vincent Lefevre <vincent <at> vinc17.net>,
 Zoltán Herczeg <hzmester <at> freemail.hu>,
 Eric Blake <eblake <at> redhat.com>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Tue, 23 Nov 2021 19:36:11 -0800
On 9/30/14 12:39, Paul Eggert wrote:

> GNU grep is smart 
> enough to start matching at character boundaries without checking the 
> validity of the input data.  This helps it run faster.  However, because 
> libpcre requires a validity prepass, grep -P must slow down and do the 
> validity check one way or another.  Grep does this only when libpcre is 
> used, and that's one reason grep -P is slower than plain grep.

Now that Grep master on Savannah has been changed to use PCRE2 instead 
of PCRE, the 'grep -P' performance problem seems to have been fixed, in 
that the following commands now take about the same amount of time:

grep -P zzzyyyxxx 10840.pdf
pcre2grep -U zzzyyyxxx 10840.pdf

where the file is from <http://research.nhm.org/pdfs/10840/10840.pdf>. 
Formerly, 'grep -P' was about 10x slower on this test.

My guess is that the grep -P performance boost comes from bleeding-edge 
grep using PCRE2's PCRE2_MATCH_INVALID_UTF option.

I'm closing this old bug report <https://bugs.gnu.org/18454>. We can 
always reopen it if there are still performance issues that I've missed.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.