#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #22 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Vincent Lefevre <vincent <at> vinc17.net> Cc: 18454 <at> debbugs.gnu.org Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Fri, 12 Sep 2014 17:59:41 -0700

Vincent Lefevre wrote: > This is still better than no optimization at all. We'd have to see; not every optimization is worth the trouble. > if the behavior is chosen by an option, the user would be aware > of the meaning of the output, so that this won't really matter. It'd be better if there wasn't a new grep option simply to avoid a libpcre performance bug. > Could you give some reference? The pcreunicode man page mentions some of this issue under "Validity of UTF-8 string". My impression is that the actual history of behavior changes is more complicated than what that simple summary would suggest. > This doesn't introduce undefined behavior, just a different > behavior Again, it'd be better if grep Just Worked. > I suppose that this is due > to the many retries from the pcresearch.c code on binary files (the > line is split into many sublines, many often consisting of a single > byte), i.e. the problem is on the grep side. libpcre is not giving 'grep' an efficient way to search data that can contain encoding errors. This does not mean "the problem is on the grep side". > I don't see how this > could be solved except by doing the UTF-8 check on the grep side. There's another way: fix libpcre so that it works on arbitrary binary data, without the need for prescreening the data. That's the fundamental problem here. >>> I often want to take binary files into account >> >> In those cases I suggest using a unibyte C locale. > > I still want "." to match a single (valid) UTF-8 character. How about this idea instead? Use a unibyte C locale, and write a unibyte regular expression C that matches a single valid UTF-8 character (using whatever definition you like for UTF-8). Then, you can use . to match single bytes and C to match characters. This gives you all the power you need, without the slowdown due to UTF-8 processing, a slowdown that will be inevitable no matter how we change grep or libpcre.

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales