GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log

Message #16 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 12 Sep 2014 09:48:08 -0700

Vincent Lefevre wrote:

> I think that (1) is rather simple

You may think it simple for the REs you're interested in, but someone 
else might say "hey! that doesn't cover the REs *I'm* interested in!". 
Solving the problem in general is nontrivial.

> But this is already the case:

I was assuming the case where the input data contains an encoding error 
(not a null byte) that is transformed to a null byte before the user 
sees it.

Really, this null-byte-replacement business would be just too weird.  I 
don't see it as a viable general-purpose solution.

> Parsing UTF-8 is standard.

It's a standard that keeps evolving, different releases of libpcre have 
done it differently, and I expect things to continue to evolve.  It's 
not something I would want to maintain separately from libpcre itself.

Have you investigated why libpcre is so *slow* when doing UTF-8 
checking?  Why would libpcre be 10x slower than grep's checking by 
hand?!?  I don't get it.  Surely there's a simple fix on the libpcre side.

> I often want to take binary files into account

In those cases I suggest using a unibyte C locale.  This should solve 
the performance problem.  Really, unibyte is the way to go here; it's 
gonna be faster for large binary scanning no matter what is done about 
this UTF-8 business.

This bug report was last modified 3 years and 231 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales