GNU bug report logs - #16893
[PATCH] Avoid matching line-by-line for case-insensitive with grep

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Thu, 27 Feb 2014 16:03:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Jim Meyering <jim <at> meyering.net>
To: Glenn Morris <rgm <at> gnu.org>
Cc: 16893 <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: bug#16893: [PATCH] Avoid matching line-by-line for case-insensitive with grep
Date: Thu, 27 Feb 2014 18:54:14 -0800
On Thu, Feb 27, 2014 at 8:54 AM, Glenn Morris <rgm <at> gnu.org> wrote:
>
> This message was sent to the submit <at> debbugs address with no
> "Package:" specified in the body. So it ended up on the help-debbugs
> mailing list rather than bug-grep. I have assigned it to grep and am
> sending this mail, which will appear on the bug-grep list.
>
> For new reports, either use the bug-grep address, or remember to use
> Package: grep at the start of the body. They both have identical results.
>
> Norihiro Tanaka wrote:
>
>> Now grep and awk matchers doesn't waste buffer in case-sensisitive matching.
>> So I think that we can avoid line-by-line matching for them.
>>
>> It enable to speed up case-sensitive matching with grep or awk matcher
>> without trivial_case_ignore as fast as when with it.
>>
>> In bug#16232:
>>> The following times 2.16, 2.17 and 2.17+patch two ways:
>>>
>>> $ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
>>> $ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
>>> /p/p/grep-2.$i/bin/grep -i foobar k; done
>>> 16
>>>        15.96 real        14.57 user         0.12 sys
>>> 17
>>>         1.13 real         1.07 user         0.06 sys
>>> 18
>>>         1.96 real         1.89 user         0.06 sys
>>>
>>> The above search takes more than 70% longer with the proposed patch.
>>
>> Therefore, I think 30% slow-down is caused by the line-by-line matching
>> for them.
>
> [See attachment at http://debbugs.gnu.org/16893]

Thank you for forwarding that, Glenn.

Thank you for the patch, Norihiro.
However, your removal of the "MB_CUR_MAX == 1" disjunct
would cause unibyte grep -i with (-F or -P) to match line-by-line,
whereas currently it uses the buffer-matching code.
That would make a search like this take 3 times longer:

  seq 30000000 > in
  LC_ALL=C grep -iF foo in

Without your patch, best of 5 wall clock time is 0.31s,
yet with the patch, it takes 1.00s.

I presume you will want to retain that disjunct?
If you submit an adjusted patch, please also include in the
commit log some timing examples showing how the change affects
grep's performance.




This bug report was last modified 11 years and 108 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.