GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #59 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: bug-grep <at> gnu.org
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sun, 21 Sep 2014 08:46:39 +0200 (CEST)
Hi,

I am the developer of the JIT compiler in PCRE. I am frequently checking the discussions about PCRE and found this comment here on bug-grep <at> gnu.org:

> There's another way: fix libpcre so that it works on arbitrary binary data, without the need for prescreening
> the data. That's the fundamental problem here. 

This requires too much effort with no benefit. Reasons:

- what should you do if you encounter an invalid UTF-8 opcode: ignore it? decode it to some random value? For example, what should happen if you find a stray 0xe9? Does it match \xe9? Everybody has different opinion about handling invalid UTF opcodes, and this would lead to never ending arguing on pcre-dev.

- the bigger problem is performance. Handling invalid UTF codes require a lot of extra checks and kills many optimizations. For example, when we encounter a 0xc5, we know that the input buffer has at least one more byte. We did not check the input buffer size. We also assume that the highest 2 bits are 10 for the second byte, and did not check this when we decode that character. This would also kill other optimizations like boyer-moore like search in JIT. The major problem is, everybody would suffer this performance regression, including those, who pass valid UTF strings.

Therefore such change will never happen due to these reasons.

But there are alternatives.

* The best solution is multi-threaded grepping: one thread reads file data, and replace/remove invalid UTF8 opcodes to something valid. The other thread runs PCRE on the filtered thread. Alternatively, you can convert everything to UTF32, and use pcre32.

* The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed, i.e. if you search /ab/, and the invalid UTF sequence contains ab, this might not be found (or might be found with interpreter, but not with JIT or vice versa). If you use pcre32, there is no need for any extra byte extension.

Regards,
Zoltan





This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.