GNU bug report logs -
#18454
Improve performance when -P (PCRE) is used in UTF-8 locales
Previous Next
Full log
View this message in rfc822 format
Hi,
>It's purely a performance question. GNU grep already uses libpcre to search
>binary data, and it works now. It's just slow, that all. I'm willing to live
>with this, and tell users "Sorry, but libpcre is not designed to search binary
>data quickly; if you want speed then don't use grep's -P option." If you're
>willing to live with this too, we're done.
Yes, PCRE is not designed for matching binary data as UTF. Too much complexity for too little gain. Normal search can be used on binary data without limitations.
>Grep already does that sort of thing. And it's smart enough to start matching
>only at character boundaries. It's not libpcre's job to worry about this; the
>caller can worry about it.
Thank you for bringing this up. I don't see any point of reimplementing what is already there. However, if PCRE says it supports UTF matching in binary data, it should. Because the "what is there" depends on the environment. This clearly the best answer why the environment is responsible for handling the binary part of the data. Most environment needs some kind of validating, and we would just duplicate code. It is good to hear that everything is in grep, perhaps a few more lines are needed to do it in a thread.
>The code you posted could be made faster than that; among other things there
>should not be an unbounded backward scan. And even the code you posted would
>often be faster than what's in libpcre now. That early UTF-8 validity prepass
>is a killer.
I would recommend to disable it. It's only purpose is returning early for invalid buffers. I am sure grep already knows that a buffer is invalid, since it scans the buffer.
Regards,
Zoltan
This bug report was last modified 3 years and 181 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.