GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Dec 2013 22:40:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #68 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Padraig Brady <P <at> draigbrady.com>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 19:44:59 -0800
[Message part 1 (text/plain, inline)]
Hmm... it's not as clear-cut as I first thought.
(I built 2.17+ the above patch and put it in a directory named grep-2.18)

The following times 2.16, 2.17 and 2.17+patch two ways:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
       15.96 real        14.57 user         0.12 sys
17
        1.13 real         1.07 user         0.06 sys
18
        1.96 real         1.89 user         0.06 sys

The above search takes more than 70% longer with the proposed patch.

Contrast that with performance in the non-UTF8 ja_JP.eucJP locale:

$ yes $(printf '%078dm' 0)|head -10000 > in
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i n in; done
16
        0.03 real         0.02 user         0.00 sys
17
        2.98 real         2.96 user         0.00 sys
18
        0.02 real         0.02 user         0.00 sys

Using the jjj+foobar example, but with only 100k lines, we see there
was a 200x performance regression going from grep-2.16 to 2.17:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -100000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
        0.15 real         0.14 user         0.00 sys
17
       27.74 real        27.72 user         0.01 sys
18
        0.11 real         0.11 user         0.00 sys

Obviously, I want to retain all of 2.17's performance gain in UTF-8 locales,
while avoiding the 200x penalty in multi-byte non-UTF8 locales like ja_JP.eucJP.
So I have prepared a better patch.
With the two attached commits (on top of 2.17), I get these timings,
i.e., the same 200x improvement with ja_JP.eucJP, and no regression
with en_US.UTF8)

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.14 real         0.14 user         0.00 sys
17:        27.97 real        27.95 user         0.01 sys
18:         0.12 real         0.12 user         0.00 sys

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.13 real         0.12 user         0.00 sys
17:         0.01 real         0.01 user         0.00 sys
18:         0.01 real         0.01 user         0.00 sys
[k.txt (text/plain, attachment)]

This bug report was last modified 11 years and 82 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.