GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #53 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Santiago Ruano Rincón <santiago <at> debian.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 18 Sep 2014 12:36:57 -0700
On Thu, Sep 18, 2014 at 1:33 AM, Santiago Ruano Rincón
<santiago <at> debian.org> wrote:
> El 17/09/14 a las 23:00, Paul Eggert escribió:
>> I've installed all the patches mentioned so far.
>>
>
> I've successfully build the latest commit
> (f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any
> performance boost. Rather the opposite.
>
> Comparing with debian's grep 2.20-3, that includes your first patch to solve
> this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch:
>
> grep -P asdf /usr/bin/*  12,42s user 0,12s system 99% cpu 12,545 total
> src/grep -P asdf /usr/bin/*  14,37s user 0,12s system 99% cpu 14,492 total
>
> Note that basic grep also slowdowns:
>
> grep asdf /usr/bin/*  0,22s user 0,16s system 99% cpu 0,382 total
> src/grep asdf /usr/bin/*  1,26s user 0,12s system 99% cpu 1,384 total

Thank you for running timing comparisons.

Once I verified that I had no large, sparse files in my grep working directory,
I ran the same test there (du -sh . reports 176M, du --app -sh . reports 139M)

The following shows a performance regression when searching files
like those in my grep working directory.
The new grep (v2.20-46-gf6de00f) takes 2.5x longer than 2.20.14.
This is with a hot cache (best of several runs) on a
Intel(R) Xeon(R) CPU E5-2660, compiled with gcc-5.x

$ diff -u <(env time grep -r asdf . 2>&1) <(PATH=src:$PATH env time
grep -r asdf . 2>&1)
--- /proc/self/fd/11    2014-09-18 12:07:43.169721947 -0700
+++ /proc/self/fd/12    2014-09-18 12:07:43.169721947 -0700
@@ -1,3 +1,3 @@
 ./src/grep.c:               printf 'asdfqwerzxcv\rASDF\tZXCV\n'
 -0.08user 0.10system 0:00.18elapsed 100%CPU (0avgtext+0avgdata
6256maxresident)k
 -0inputs+0outputs (0major+670minor)pagefaults 0swaps
 +0.40user 0.11system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 5328maxresident)k
 +0inputs+0outputs (0major+634minor)pagefaults 0swaps

It looks like most of the difference is the result of
commit cd36abd46c5e0768606979ea75a51732062f5624,
"grep: treat a file as binary if its prefix contains encoding errors",
with its new,
locale-sensitive "is_binary" test. I saw the above timing difference
even with LC_ALL=C, so one quick fix would be to skip the use of
mbrlen when possible.




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.