GNU bug report logs -
#18454
Improve performance when -P (PCRE) is used in UTF-8 locales
Previous Next
Full log
Message #16 received at 18454 <at> debbugs.gnu.org (full text, mbox):
Vincent Lefevre wrote:
> I think that (1) is rather simple
You may think it simple for the REs you're interested in, but someone
else might say "hey! that doesn't cover the REs *I'm* interested in!".
Solving the problem in general is nontrivial.
> But this is already the case:
I was assuming the case where the input data contains an encoding error
(not a null byte) that is transformed to a null byte before the user
sees it.
Really, this null-byte-replacement business would be just too weird. I
don't see it as a viable general-purpose solution.
> Parsing UTF-8 is standard.
It's a standard that keeps evolving, different releases of libpcre have
done it differently, and I expect things to continue to evolve. It's
not something I would want to maintain separately from libpcre itself.
Have you investigated why libpcre is so *slow* when doing UTF-8
checking? Why would libpcre be 10x slower than grep's checking by
hand?!? I don't get it. Surely there's a simple fix on the libpcre side.
> I often want to take binary files into account
In those cases I suggest using a unibyte C locale. This should solve
the performance problem. Really, unibyte is the way to go here; it's
gonna be faster for large binary scanning no matter what is done about
this UTF-8 business.
This bug report was last modified 3 years and 181 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.