#18454 - Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Zoltán Herczeg <hzmester <at> freemail.hu> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: 18454 <at> debbugs.gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Tue, 30 Sep 2014 20:10:58 +0200 (CEST)

Hi, >It's purely a performance question. GNU grep already uses libpcre to search >binary data, and it works now. It's just slow, that all. I'm willing to live >with this, and tell users "Sorry, but libpcre is not designed to search binary >data quickly; if you want speed then don't use grep's -P option." If you're >willing to live with this too, we're done. Yes, PCRE is not designed for matching binary data as UTF. Too much complexity for too little gain. Normal search can be used on binary data without limitations. >Grep already does that sort of thing. And it's smart enough to start matching >only at character boundaries. It's not libpcre's job to worry about this; the >caller can worry about it. Thank you for bringing this up. I don't see any point of reimplementing what is already there. However, if PCRE says it supports UTF matching in binary data, it should. Because the "what is there" depends on the environment. This clearly the best answer why the environment is responsible for handling the binary part of the data. Most environment needs some kind of validating, and we would just duplicate code. It is good to hear that everything is in grep, perhaps a few more lines are needed to do it in a thread. >The code you posted could be made faster than that; among other things there >should not be an unbounded backward scan. And even the code you posted would >often be faster than what's in libpcre now. That early UTF-8 validity prepass >is a killer. I would recommend to disable it. It's only purpose is returning early for invalid buffers. I am sure grep already knows that a buffer is invalid, since it scans the buffer. Regards, Zoltan

This bug report was last modified 3 years and 231 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales