GNU bug report logs -
#18454
Improve performance when -P (PCRE) is used in UTF-8 locales
Previous Next
Full log
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
With the patch that fixes bug 18266, grep -P works again on binary
files (with invalid UTF-8 sequences), but it is now significantly
slower than old versions (which could yield undefined behavior).
Timings with the Debian packages on my personal svn working copy
(binary + text files):
2.18-2 0.9s with -P, 0.4s without -P
2.20-3 11.6s with -P, 0.4s without -P
On this example, that's a 13x slowdown! Though the performance issue
would better be fixed in libpcre3, I suppose that it is not so simple
and won't occur any time soon. Things could be done in grep:
1. Ignore -P when the pattern would have the same meaning without -P
(patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
at least for the simplest cases).
2. Call PCRE in the C locale when this is equivalent.
3. Transform invalid bytes to null bytes in-place before the PCRE
call. This changes the current semantic, but:
* the semantic on invalid bytes has never been specified, AFAIK;
* the best *practical* behavior may not be the current one
(I personally prefer to be able to match invalid bytes, just
like one can match top-bit-set characters in the C locale, and
seeing such invalid bytes as equivalent to null bytes would
not be a problem for most users, IMHO -- things can also be
configurable).
--
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
This bug report was last modified 3 years and 181 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.