GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: Vincent Lefevre <vincent <at> vinc17.net>
To: 18454 <at> debbugs.gnu.org
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Fri, 28 Nov 2014 03:59:18 +0100

[Message part 1 (text/plain, inline)]

On binary files, it seems that testing the UTF-8 sequences in
pcresearch.c is faster than asking pcre_exec to do that (because
of the retry I assume); see attached patch. It actually checks
UTF-8 only if an invalid sequence was already found by pcre_exec,
assuming that pcre_exec can check the validity of a valid text
file in a faster way.

On some file similar to PDF (test 1):

Before: 1.77s
After:  1.38s

But now, the main problem is the many pcre_exec. Indeed, if I replace
the non-ASCII bytes by \n with:

  LC_ALL=C tr \\200-\\377 \\n

(now, one has a valid file but with many short lines), the grep -P time
is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
with:

  LC_ALL=C tr \\200-\\377 \\000

the grep -P time is 0.30s (test 3), thus it is much faster.

Note also that libpcre is much slower than normal grep on simple words,
but on "a[0-9]b", it can be faster:

          grep      PCRE   PCRE+patch
test 1    4.31      1.90      1.53
test 2    0.18      1.61      1.63
test 3    3.28      0.39      0.39

With grep, I wonder why test 2 is much faster.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[grep221-pcresearch.patch (text/plain, attachment)]

This bug report was last modified 3 years and 231 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales