GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: 18454 <at> debbugs.gnu.org
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sat, 27 Sep 2014 13:54:24 -0700
Zoltán Herczeg wrote:
> He said 'I still want "." to match a single (valid) UTF-8 character.'

That's what the GNU matchers do, yes.  '.' does not match an invalid byte.  It's 
a reasonable default.  If you have some users who want '.' to match an invalid 
byte, you can add a flag for them, just as there's a PCRE_DOTALL flag for users 
who want '.' to match newline.  That being said, I doubt whether users will care 
enough to need such a flag.  (After all, they're evidently not caring *now*, as 
libpcre can't search such data at *all*.)

> In the regex world, matching performance is the key aspect of an engine

Absolutely.  That's why we're having this discussion: libpcre is slow when 
matching binary data.

> A "simple" change like this would require a major redesign of the engine.

It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm volunteering....)

> What should happen, if the starting offset is inside an otherwise valid UTF character?

The same thing that would happen if an input file started with the tail end of a 
UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this already; 
it's not a problem.

>> Filtering would not be needed if libpcre were like grep's other matchers
>> and simply worked with arbitrary binary data.
>
> This might be efficient for engines which scans the input only forward direction
> and read every character once.

It can also be efficient for matchers, like grep's, that don't necessarily do 
that.  It just takes more implementation work, that's all.  It's not rocket 
science to go backwards through a UTF-8 string and to catch decoding errors as 
you go.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.