GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #22 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 12 Sep 2014 17:59:41 -0700
Vincent Lefevre wrote:

> This is still better than no optimization at all.

We'd have to see; not every optimization is worth the trouble.

> if the behavior is chosen by an option, the user would be aware
> of the meaning of the output, so that this won't really matter.

It'd be better if there wasn't a new grep option simply to avoid a 
libpcre performance bug.

> Could you give some reference?

The pcreunicode man page mentions some of this issue under "Validity of 
UTF-8 string".  My impression is that the actual history of behavior 
changes is more complicated than what that simple summary would suggest.

> This doesn't introduce undefined behavior, just a different
> behavior

Again, it'd be better if grep Just Worked.

> I suppose that this is due
> to the many retries from the pcresearch.c code on binary files (the
> line is split into many sublines, many often consisting of a single
> byte), i.e. the problem is on the grep side.

libpcre is not giving 'grep' an efficient way to search data that can 
contain encoding errors.  This does not mean "the problem is on the grep 
side".

> I don't see how this
> could be solved except by doing the UTF-8 check on the grep side.

There's another way: fix libpcre so that it works on arbitrary binary 
data, without the need for prescreening the data.  That's the 
fundamental problem here.

>>> I often want to take binary files into account
>>
>> In those cases I suggest using a unibyte C locale.
>
> I still want "." to match a single (valid) UTF-8 character.

How about this idea instead?  Use a unibyte C locale, and write a 
unibyte regular expression C that matches a single valid UTF-8 character 
(using whatever definition you like for UTF-8).  Then, you can use . to 
match single bytes and C to match characters.  This gives you all the 
power you need, without the slowdown due to UTF-8 processing, a slowdown 
that will be inevitable no matter how we change grep or libpcre.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.