GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #86 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 26 Sep 2014 12:20:45 -0700
On 09/26/2014 11:04 AM, Zoltán Herczeg wrote:
> this is a very interesting discussion.

Yes, I have a lot of other things I'm *supposed* to be doing, but this 
thread is more fun....

>>> /(?<=\x9c)#/
>>>
>>> Does it match \xd5\x9c# starting from #?
>> No, because the input does not contain a \x9c encoding error.  Encoding errors
>> match only themselves, not parts of other characters.  That is how the glibc
>> matchers behave, and it's what users expect.
> Why \xc9 is part of another character? It depends how you interpret \xd5.

Sorry, I assume you meant \x9c here?  Anyway, the point is that 
conceptually you walk through the input byte sequence left-to-right, 
converting it to characters as you go, and if you encounter an encoding 
error in the process you convert the error to the corresponding 
"character" outside the Unicode range.  You then do all matching against 
the converted sequence.  So there is no question about interpretation: 
it's the left-to-right interpretation.  This simple and easy-to-explain 
approach is used by grep's other matchers, by Emacs, etc.

Obviously you don't want to *implement* it the way I described; instead, 
you want to convert on-the-fly, lazily.  But whatever optimizations you 
do, you do consistently with the conceptual model.

> The problem is, you do it some way, and others need something else.

In practice, the simple approach explained above works well enough to 
satisfy the vast majority of users.  It's conceivable some special cases 
in the PCRE world would have trouble fitting into this model, but to be 
honest I expect this won't be a problem, and that there won't be any 
serious conceptual issues here, though admittedly there will be some 
nontrivial programming effort.
.
> I have doubts that slowing down PCRE would increase grep performance.

Again, the proposed change should not slow down libpcre.  It should 
speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre 
could use exactly the same code it has now, so performance would be 
unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should 
typically be faster than it is now, because it would avoid unnecessary 
UTF-8 validation for the parts of the input string that it does not examine.


> This is exactly the use case where filtering is needed. His input is a 
> combination of binary and UTF data, and he needs matches only in the 
> UTF part. Regards, Zoltan 

Filtering would not be needed if libpcre were like grep's other matchers 
and simply worked with arbitrary binary data.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.