#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #137 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp> To: Vincent Lefevre <vincent <at> vinc17.net> Cc: 18454 <at> debbugs.gnu.org Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Fri, 28 Nov 2014 23:31:49 +0900

On Fri, 28 Nov 2014 03:59:18 +0100 Vincent Lefevre <vincent <at> vinc17.net> wrote: > On binary files, it seems that testing the UTF-8 sequences in > pcresearch.c is faster than asking pcre_exec to do that (because > of the retry I assume); see attached patch. It actually checks > UTF-8 only if an invalid sequence was already found by pcre_exec, > assuming that pcre_exec can check the validity of a valid text > file in a faster way. > > On some file similar to PDF (test 1): > > Before: 1.77s > After: 1.38s > > But now, the main problem is the many pcre_exec. Indeed, if I replace > the non-ASCII bytes by \n with: > > LC_ALL=C tr \\200-\\377 \\n > > (now, one has a valid file but with many short lines), the grep -P time > is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes > with: > > LC_ALL=C tr \\200-\\377 \\000 > > the grep -P time is 0.30s (test 3), thus it is much faster. > > Note also that libpcre is much slower than normal grep on simple words, > but on "a[0-9]b", it can be faster: > > grep PCRE PCRE+patch > test 1 4.31 1.90 1.53 > test 2 0.18 1.61 1.63 > test 3 3.28 0.39 0.39 > > With grep, I wonder why test 2 is much faster. > > -- > Vincent Lefevre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/> > 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> > Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) Thanks for the patch. However, I seem that valid_utf() in PCRE also considers 5 and 6 bytes characters in PCRE. IMHO, We assume that grep doesn't know how to check for an input text in valid_utf(), althouth we know PCRE checks whether an input text is valid utf8 or not, so that even when PCRE changes behaviour of valid_utf(), grep should run. If we do not check invalid utf8 characters with valid_utf8() in advance, grep may cause core dump with PCRE_NO_UTF8_CHECK. See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586 So we can not avoid for checking invalid utf8 characters with valid_utf8(). Further more, we must perform to check as PCRE expects, but grep does not know how to PCRE to check invalid_utf8 characters due to an above assumption.

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales