GNU bug report logs -
#16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Previous Next
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Mon, 23 Dec 2013 22:40:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #68 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hmm... it's not as clear-cut as I first thought.
(I built 2.17+ the above patch and put it in a directory named grep-2.18)
The following times 2.16, 2.17 and 2.17+patch two ways:
$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
15.96 real 14.57 user 0.12 sys
17
1.13 real 1.07 user 0.06 sys
18
1.96 real 1.89 user 0.06 sys
The above search takes more than 70% longer with the proposed patch.
Contrast that with performance in the non-UTF8 ja_JP.eucJP locale:
$ yes $(printf '%078dm' 0)|head -10000 > in
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i n in; done
16
0.03 real 0.02 user 0.00 sys
17
2.98 real 2.96 user 0.00 sys
18
0.02 real 0.02 user 0.00 sys
Using the jjj+foobar example, but with only 100k lines, we see there
was a 200x performance regression going from grep-2.16 to 2.17:
$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -100000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
0.15 real 0.14 user 0.00 sys
17
27.74 real 27.72 user 0.01 sys
18
0.11 real 0.11 user 0.00 sys
Obviously, I want to retain all of 2.17's performance gain in UTF-8 locales,
while avoiding the 200x penalty in multi-byte non-UTF8 locales like ja_JP.eucJP.
So I have prepared a better patch.
With the two attached commits (on top of 2.17), I get these timings,
i.e., the same 200x improvement with ja_JP.eucJP, and no regression
with en_US.UTF8)
$ for i in 16 17 18; do printf "$i: "; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16: 0.14 real 0.14 user 0.00 sys
17: 27.97 real 27.95 user 0.01 sys
18: 0.12 real 0.12 user 0.00 sys
$ for i in 16 17 18; do printf "$i: "; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16: 0.13 real 0.12 user 0.00 sys
17: 0.01 real 0.01 user 0.00 sys
18: 0.01 real 0.01 user 0.00 sys
[k.txt (text/plain, attachment)]
This bug report was last modified 11 years and 82 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.