GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Dec 2013 22:40:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16232 in the body.
You can then email your comments to 16232 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Mon, 23 Dec 2013 22:40:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jim Meyering <jim <at> meyering.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 23 Dec 2013 22:40:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: bug-grep <at> gnu.org
Subject: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in
 multibyte locales
Date: Mon, 23 Dec 2013 14:39:14 -0800

[Message part 1 (text/plain, inline)]

FYI, here is a quick and clean/safe performance improvement for grep -i.
I expect to push this commit right after the upcoming bug-fix release.
Currently, this optimization is enabled when the search string is
ASCII and contains neither of '\' (backslash) nor '['.  I expect to
eliminate the latter two constraints in a follow-on commit including
tests to exercise all of the corner cases.

Happy holidays,
Jim

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Mon, 23 Dec 2013 22:53:02 GMT) Full text and rfc822 format available.

Message #8 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Jim Meyering <jim <at> meyering.net>, 16232 <at> debbugs.gnu.org
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:52:26 -0700

[Message part 1 (text/plain, inline)]

On 12/23/2013 03:39 PM, Jim Meyering wrote:
> FYI, here is a quick and clean/safe performance improvement for grep -i.
> I expect to push this commit right after the upcoming bug-fix release.
> Currently, this optimization is enabled when the search string is
> ASCII and contains neither of '\' (backslash) nor '['.  I expect to
> eliminate the latter two constraints in a follow-on commit including
> tests to exercise all of the corner cases.
> 

> +
> +  /* Worst case is that every byte of keys will be alpha,
> +     so every byte B will map to the sequence of 4 bytes [Bb].  */

Umm, is this always true?  Consider the UTF-8 Turkish locale, where
single-byte i has a multi-byte uppercase (and conversely the single-byte
I has a multi-byte lowercase) - that is, 'i' and 'I' are not case pairs.

> +      else
> +        {
> +          *p++ = '[';
> +          *p++ = *keys;
> +          *p++ = islower (*keys) ? toupper (*keys) : tolower (*keys);

This performs the ASCII-only toupper/tolower, rather than the proper
locale-sensitive case conversion, which probably makes this patch
misbehave for LC_ALL=tr_TR.UTF-8.

> 
> +  /* As currently implemented, case-insensitive matching is expensive in
> +     multi-byte locales because of a few outlier locales in which some
> +     characters change size when converted to upper or lower case.  To
> +     accommodate those, we revert to searching the input one line at a
> +     time, rather than using the much more efficient buffer search.
> +     However, if we have a plain ascii search string, /foo/, we can
> +     convert it to an equivalent case-insensitive /[fF][oO][oO]/, and thus
> +     avoid the expensive read-and-process-a-line-at-a-time requirement.
> +     Optimize-away the "-i" option, when possible, converting each
> +     candidate alpha, C, in the regexp to [Cc].  */

In other words, this comment describes the very flaw of the outlier
tr_TR locale that you are now violating with your optimization.

Without your patch, this is correct behavior:

$ echo i | LC_ALL=tr_TR.UTF-8 grep -i I
$ echo i | LC_ALL=tr_TR.UTF-8 grep -i i
i

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Mon, 23 Dec 2013 23:13:01 GMT) Full text and rfc822 format available.

Message #11 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Eric Blake <eblake <at> redhat.com>
Cc: 16232 <at> debbugs.gnu.org
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:12:26 -0800

On Mon, Dec 23, 2013 at 2:52 PM, Eric Blake <eblake <at> redhat.com> wrote:
> On 12/23/2013 03:39 PM, Jim Meyering wrote:
>> FYI, here is a quick and clean/safe performance improvement for grep -i.
>> I expect to push this commit right after the upcoming bug-fix release.
>> Currently, this optimization is enabled when the search string is
>> ASCII and contains neither of '\' (backslash) nor '['.  I expect to
>> eliminate the latter two constraints in a follow-on commit including
>> tests to exercise all of the corner cases.
>>
>
>> +
>> +  /* Worst case is that every byte of keys will be alpha,
>> +     so every byte B will map to the sequence of 4 bytes [Bb].  */
>
> Umm, is this always true?  Consider the UTF-8 Turkish locale, where

Hi Eric,

Thanks for the review.
Did you miss the "isascii" check in the new trivial_case_convert function?
If you can describe circumstances in which the new patch malfunctions,
please do,
but everything you wrote seems to rely on a false assumption.
E.g., your turkish-I example works fine with my patch.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Mon, 23 Dec 2013 23:31:02 GMT) Full text and rfc822 format available.

Message #14 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16232 <at> debbugs.gnu.org
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Mon, 23 Dec 2013 16:30:19 -0700

[Message part 1 (text/plain, inline)]

On 12/23/2013 04:12 PM, Jim Meyering wrote:
> Did you miss the "isascii" check in the new trivial_case_convert function?

No.  But even with that check in place:

> If you can describe circumstances in which the new patch malfunctions,
> please do,
> but everything you wrote seems to rely on a false assumption.

No, it's a quite real complaint - your patch is broken for tr_TR.

> E.g., your turkish-I example works fine with my patch.

isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
tr_TR locale.  Rather, the conversion must be to '[iİ]'; similarly, 'I'
would be translated to '[Iı]'.  Neither of those conversions fit in 4
bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
characters).

Need help easily finding those characters on a non-Turkish keyboard?  I
used:
$ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'

At any rate, prior to your patch, lower dotless i in the buffer gives an
insensitive match to upper dotless I in the pattern:

$ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
ı

After your patch:

$ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
no match

Oops, you failed to match lower dotless i insensitively against upper
dotless I, because upper dotless I is ascii, but you incorrectly
converted it into the wrong pattern.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Mon, 23 Dec 2013 23:55:03 GMT) Full text and rfc822 format available.

Message #17 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Eric Blake <eblake <at> redhat.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:54:27 -0800

On Mon, Dec 23, 2013 at 3:30 PM, Eric Blake <eblake <at> redhat.com> wrote:
> On 12/23/2013 04:12 PM, Jim Meyering wrote:
>> Did you miss the "isascii" check in the new trivial_case_convert function?
>
> No.  But even with that check in place:
>
>> If you can describe circumstances in which the new patch malfunctions,
>> please do,
>> but everything you wrote seems to rely on a false assumption.
>
> No, it's a quite real complaint - your patch is broken for tr_TR.
>
>> E.g., your turkish-I example works fine with my patch.
>
> isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
> tr_TR locale.  Rather, the conversion must be to '[iİ]'; similarly, 'I'
> would be translated to '[Iı]'.  Neither of those conversions fit in 4
> bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
> characters).
>
> Need help easily finding those characters on a non-Turkish keyboard?  I
> used:
> $ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
>
> At any rate, prior to your patch, lower dotless i in the buffer gives an
> insensitive match to upper dotless I in the pattern:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
> ı
>
> After your patch:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
> no match
>
> Oops, you failed to match lower dotless i insensitively against upper
> dotless I, because upper dotless I is ascii, but you incorrectly
> converted it into the wrong pattern.

Thanks for dotting those 'i's.  While there is no risk of buffer
overrun, there would definitely be a problem with the tr_TR locale.
I will resolve it by removing the isascii check and performing
multibyte case conversion to form each [cC] pair.  Of course,
that will mean removing the "4 * byte-length-of-search-string"
buffer size limitation.

I will also add tests based on your examples.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Wed, 08 Jan 2014 03:57:02 GMT) Full text and rfc822 format available.

Message #20 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Eric Blake <eblake <at> redhat.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Tue, 7 Jan 2014 19:56:21 -0800

[Message part 1 (text/plain, inline)]

On Mon, Dec 23, 2013 at 3:54 PM, Jim Meyering <jim <at> meyering.net> wrote:
...
> Thanks for dotting those 'i's.  While there is no risk of buffer
> overrun, there would definitely be a problem with the tr_TR locale.
> I will resolve it by removing the isascii check and performing
> multibyte case conversion to form each [cC] pair.  Of course,
> that will mean removing the "4 * byte-length-of-search-string"
> buffer size limitation.
>
> I will also add tests based on your examples.

Here is the improved patch.

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Fri, 10 Jan 2014 05:20:01 GMT) Full text and rfc822 format available.

Message #23 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Eric Blake <eblake <at> redhat.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 9 Jan 2014 21:19:21 -0800

[Message part 1 (text/plain, inline)]

On Tue, Jan 7, 2014 at 7:56 PM, Jim Meyering <jim <at> meyering.net> wrote:
> Here is the improved patch.

I've added a NEWS entry.  Pushing tomorrow:

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sat, 11 Jan 2014 01:50:01 GMT) Full text and rfc822 format available.

Message #26 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: 16232 <at> debbugs.gnu.org, Jim Meyering <jim <at> meyering.net>
Subject: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Sat, 11 Jan 2014 01:49:30 +0000

Cool so it does this transformation:

  sed 's/./[\L&\U&]/g'

Though multi byte case handling has all sorts of edge cases (pardon the pun),
and it may not be always valid to treat each character independently?
For example see some of the tests in:
http://git.sv.gnu.org/gitweb/?p=gnulib.git;a=blob;f=tests/unicase/test-ulc-casecmp.c;hb=HEAD

I wonder might this faster path be restricted to a safer but very common input subset of:

(MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))

Also are the following printfs in the test redundant?

> +data=$(      printf "I:$I $i:i")
> +search_str=$(printf "$i:i I:$I")

nice improvement!
Pádraig.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sat, 11 Jan 2014 04:53:01 GMT) Full text and rfc822 format available.

Message #29 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Fri, 10 Jan 2014 20:52:00 -0800

[Message part 1 (text/plain, inline)]

On Fri, Jan 10, 2014 at 5:49 PM, Pádraig Brady <P <at> draigbrady.com> wrote:
> Cool so it does this transformation:
>
>   sed 's/./[\L&\U&]/g'
>
> Though multi byte case handling has all sorts of edge cases (pardon the pun),
> and it may not be always valid to treat each character independently?
> For example see some of the tests in:
> http://git.sv.gnu.org/gitweb/?p=gnulib.git;a=blob;f=tests/unicase/test-ulc-casecmp.c;hb=HEAD

It seems you're right.  Since it's a many-to-one mapping in some
cases, simply using one lower case character and one upper case
version won't cover all possibilities.

> I wonder might this faster path be restricted to a safer but very common input subset of:
>
> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))

That sounds like a good approach.
Now I need another test case, to demonstrate that the current code can
cause trouble.

> Also are the following printfs in the test redundant?
>
>> +data=$(      printf "I:$I $i:i")
>> +search_str=$(printf "$i:i I:$I")

Good catch.  Those were vestiges of pre-factoring code, where they
were needed.  Here's the patch to fix that part, in your name:

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sat, 11 Jan 2014 05:41:01 GMT) Full text and rfc822 format available.

Message #32 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Fri, 10 Jan 2014 21:40:24 -0800

On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>
>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>
> That sounds like a good approach.
> Now I need another test case, to demonstrate that the current code can
> cause trouble.

Hmm... after thinking about this for a while and actually trying to
break the current code (did not find a way to demonstrate a regression),
I have concluded that the current approach is no worse than the prior
one of matching a case-mapped regexp vs. each case-mapped input line.

That's not to say that it's perfect, of course.
The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
from gnulib's test-ulc-casecmp.c is a great example: this matches:

    printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
'\x6A\xCC\x8C\xCC\xA3')"

but this does not, yet probably should:

    printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"

Can you see a way to demonstrate a regression?

Thanks again,

Jim

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sat, 11 Jan 2014 11:34:02 GMT) Full text and rfc822 format available.

Message #35 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Sat, 11 Jan 2014 11:33:47 +0000

On 01/11/2014 05:40 AM, Jim Meyering wrote:
> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>
>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>
>> That sounds like a good approach.
>> Now I need another test case, to demonstrate that the current code can
>> cause trouble.
> 
> Hmm... after thinking about this for a while and actually trying to
> break the current code (did not find a way to demonstrate a regression),
> I have concluded that the current approach is no worse than the prior
> one of matching a case-mapped regexp vs. each case-mapped input line.
> 
> That's not to say that it's perfect, of course.
> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
> from gnulib's test-ulc-casecmp.c is a great example: this matches:
> 
>     printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
> '\x6A\xCC\x8C\xCC\xA3')"
> 
> but this does not, yet probably should:
> 
>     printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
> 
> Can you see a way to demonstrate a regression?

Oh right, it doesn't handle these cases already.
Fair enough I don't see a regression then.

+1

Pádraig.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sat, 11 Jan 2014 14:17:01 GMT) Full text and rfc822 format available.

Message #38 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Sat, 11 Jan 2014 14:15:58 +0000

On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>>
>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>
>>> That sounds like a good approach.
>>> Now I need another test case, to demonstrate that the current code can
>>> cause trouble.
>>
>> Hmm... after thinking about this for a while and actually trying to
>> break the current code (did not find a way to demonstrate a regression),
>> I have concluded that the current approach is no worse than the prior
>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>
>> That's not to say that it's perfect, of course.
>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>
>>     printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> but this does not, yet probably should:
>>
>>     printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
>>
>> Can you see a way to demonstrate a regression?
> 
> Oh right, it doesn't handle these cases already.
> Fair enough I don't see a regression then.

This is also a good summary of stuff to consider with case:
http://www.unicode.org/faq/casemap_charprop.html

So picking another case situation from there:
  "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
   the regular (U+03C2) and final (U+03C3) lowercase sigma."

One can see that sed handles this:
  $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
  ςσΣΣ
  $ printf '\u03A3\n' | sed 's/.*/&\L&/'
  Σσ

Though I was surprised the grep (2.14) didn't match any combo of these
  $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
  $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
  $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"

Not a regression of course.

cheers,
Pádraig.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sat, 11 Jan 2014 17:58:01 GMT) Full text and rfc822 format available.

Message #41 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Sat, 11 Jan 2014 09:56:53 -0800

On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
>> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>>>
>>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>>
>>>> That sounds like a good approach.
>>>> Now I need another test case, to demonstrate that the current code can
>>>> cause trouble.
>>>
>>> Hmm... after thinking about this for a while and actually trying to
>>> break the current code (did not find a way to demonstrate a regression),
>>> I have concluded that the current approach is no worse than the prior
>>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>>
>>> That's not to say that it's perfect, of course.
>>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>>
>>>     printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>>> '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> but this does not, yet probably should:
>>>
>>>     printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> Can you see a way to demonstrate a regression?
>>
>> Oh right, it doesn't handle these cases already.
>> Fair enough I don't see a regression then.
>
> This is also a good summary of stuff to consider with case:
> http://www.unicode.org/faq/casemap_charprop.html
>
> So picking another case situation from there:
>   "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
>    the regular (U+03C2) and final (U+03C3) lowercase sigma."
>
> One can see that sed handles this:
>   $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
>   ςσΣΣ
>   $ printf '\u03A3\n' | sed 's/.*/&\L&/'
>   Σσ
>
> Though I was surprised the grep (2.14) didn't match any combo of these
>   $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
>
> Not a regression of course.

Thank you for the reference and the fine examples.
I'll add the latter as a known-failing test.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sun, 12 Jan 2014 04:37:01 GMT) Full text and rfc822 format available.

Message #44 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Sat, 11 Jan 2014 20:36:31 -0800

On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
...
> This is also a good summary of stuff to consider with case:
> http://www.unicode.org/faq/casemap_charprop.html
>
> So picking another case situation from there:
>   "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
>    the regular (U+03C2) and final (U+03C3) lowercase sigma."
>
> One can see that sed handles this:
>   $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
>   ςσΣΣ
>   $ printf '\u03A3\n' | sed 's/.*/&\L&/'
>   Σσ
>
> Though I was surprised the grep (2.14) didn't match any combo of these
>   $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"

Actually, if you quote the argument to the latter printf, two of those
do match, both with -F and without:

$ printf '\u03C2\u03C3\n' | grep -Fi "$(printf '\u03A3')"
ςσ
$ printf '\u03A3\n' | grep -Fi "$(printf '\u03C2')"
$ printf '\u03A3\n' | grep -Fi "$(printf '\u03C3')"
Σ

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Sun, 12 Jan 2014 12:57:01 GMT) Full text and rfc822 format available.

Message #47 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Sun, 12 Jan 2014 12:56:23 +0000

On 01/12/2014 04:36 AM, Jim Meyering wrote:
> On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
>> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> ...
>> This is also a good summary of stuff to consider with case:
>> http://www.unicode.org/faq/casemap_charprop.html
>>
>> So picking another case situation from there:
>>   "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
>>    the regular (U+03C2) and final (U+03C3) lowercase sigma."
>>
>> One can see that sed handles this:
>>   $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
>>   ςσΣΣ
>>   $ printf '\u03A3\n' | sed 's/.*/&\L&/'
>>   Σσ
>>
>> Though I was surprised the grep (2.14) didn't match any combo of these
>>   $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
>>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
>>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
> 
> Actually, if you quote the argument to the latter printf, two of those
> do match, both with -F and without:
> 
> $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf '\u03A3')"
> ςσ
> $ printf '\u03A3\n' | grep -Fi "$(printf '\u03C2')"
> $ printf '\u03A3\n' | grep -Fi "$(printf '\u03C3')"
> Σ

Oops right.
So that's still no regression with the new scheme
since grep is 1:1 here for Σ and σ.

thanks,
Pádraig.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Wed, 19 Feb 2014 14:23:01 GMT) Full text and rfc822 format available.

Message #50 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Padraig Brady <P <at> draigBrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Jim Meyering <jim <at> meyering.net>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 23:22:08 +0900

Hi,

Slow down may be caused by the patch, because MBCSET is processed by not
DFA engine but regexp engine.

I tested performance on grep-2.17 and the version which the patch is reverted.
Latter is 100x faster.

yes $(printf '%078dm' 0)|head -10000 > in

grep-2.17 original:
$ for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done
Command exited with non-zero status 1
5.92user 1.69system 0:07.73elapsed 98%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps
Command exited with non-zero status 1
5.59user 1.87system 0:07.58elapsed 98%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
6.06user 1.58system 0:07.81elapsed 97%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
5.73user 1.66system 0:07.52elapsed 98%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps
Command exited with non-zero status 1
6.42user 1.19system 0:07.86elapsed 96%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
6.15user 1.56system 0:08.34elapsed 92%CPU (0avgtext+0avgdata 3888maxresident)k
0inputs+0outputs (0major+424minor)pagefaults 0swaps
Command exited with non-zero status 1
6.97user 0.61system 0:07.77elapsed 97%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps
Command exited with non-zero status 1
7.00user 0.57system 0:07.71elapsed 98%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
7.16user 0.25system 0:07.56elapsed 97%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
7.04user 0.39system 0:07.60elapsed 97%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps

After revert the patch:
$ for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done
Command exited with non-zero status 1
0.07user 0.02system 0:00.10elapsed 92%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+232minor)pagefaults 0swaps
Command exited with non-zero status 1
0.03user 0.01system 0:00.05elapsed 101%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 90%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.03user 0.02system 0:00.05elapsed 103%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 86%CPU (0avgtext+0avgdata 3088maxresident)k
0inputs+0outputs (0major+219minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 91%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 87%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps
Command exited with non-zero status 1
0.03user 0.02system 0:00.05elapsed 105%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.00system 0:00.06elapsed 90%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 90%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps

Norihiro

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Wed, 19 Feb 2014 18:31:02 GMT) Full text and rfc822 format available.

Message #53 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Padraig Brady <P <at> draigBrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Wed, 19 Feb 2014 10:30:24 -0800

On 02/19/2014 06:22 AM, Norihiro Tanaka wrote:
> I tested performance on grep-2.17 and the version which the patch is reverted.
> Latter is 100x faster.

While we're on the topic, can someone please say why that patch was done 
as a preprocessor for the regex code?  I normally would think that the 
way to speed up regex processing is to improve dfa.c and/or the regex 
library, as that should speed up all programs that use these modules, 
not just grep.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Wed, 19 Feb 2014 19:18:02 GMT) Full text and rfc822 format available.

Message #56 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Padraig Brady <P <at> draigbrady.com>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 11:17:16 -0800

[Message part 1 (text/plain, inline)]

On Wed, Feb 19, 2014 at 6:22 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done

Wow.  You're right.  With the attached patch, I see a speedup of more
than 130x in this case: (fyi, the "time" output is slightly different,
because I have installed GNU time)

grep-2.17$ for i in $(seq 5); do env LC_ALL=ja_JP.eucJP time grep -i n in; done
        2.78 real         2.78 user         0.00 sys
        2.73 real         2.73 user         0.00 sys
        2.75 real         2.75 user         0.00 sys
        2.73 real         2.73 user         0.00 sys
        2.74 real         2.74 user         0.00 sys

2.17+patch$ for i in $(seq 5); do env LC_ALL=ja_JP.eucJP time src/grep
-i n in; done
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys

I haven't investigated they "why" yet, but expect that I will make
grep-2.18 with just this one performance-improving patch.

Thank you, Norihiro,

Jim

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Wed, 19 Feb 2014 19:52:01 GMT) Full text and rfc822 format available.

Message #59 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Padraig Brady <P <at> draigbrady.com>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 11:51:27 -0800

On Wed, Feb 19, 2014 at 10:30 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> While we're on the topic, can someone please say why that patch was done as
> a preprocessor for the regex code?  I normally would think that the way to
> speed up regex processing is to improve dfa.c and/or the regex library, as
> that should speed up all programs that use these modules, not just grep.

Hi Paul,
My patch was a first crack at dealing with an inefficiency that was
grep-specific, so it made sense to apply as a grep-specific preprocessing
phase.  Norihiro found an additional improvement, that, it so happens,
would have had an even bigger impact if my patch had not been applied first.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Wed, 19 Feb 2014 19:57:01 GMT) Full text and rfc822 format available.

Message #62 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-grep <at> gnu.org
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Wed, 19 Feb 2014 11:55:29 -0800

On 02/19/2014 11:17 AM, Jim Meyering wrote:
> I haven't investigated they "why" yet, but expect that I will make
> grep-2.18 with just this one performance-improving patch.

Can you make room for one more patch?  I have a test case and bugfix for 
grep's mishandling of regular expressions like [^^-~] in unibyte locales 
like the C locale on most platforms; this is due to the range-expression 
bug that Arnold reported recently.I can send out the patch later today 
if you like.  The bug has been present since grep 2.8 so I can 
understand if you'd rather wait to fix it until later.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Wed, 19 Feb 2014 20:02:02 GMT) Full text and rfc822 format available.

Message #65 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 12:00:38 -0800

On Wed, Feb 19, 2014 at 11:55 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 02/19/2014 11:17 AM, Jim Meyering wrote:
>>
>> I haven't investigated they "why" yet, but expect that I will make
>> grep-2.18 with just this one performance-improving patch.
>
> Can you make room for one more patch?  I have a test case and bugfix for
> grep's mishandling of regular expressions like [^^-~] in unibyte locales
> like the C locale on most platforms; this is due to the range-expression bug
> that Arnold reported recently.I can send out the patch later today if you
> like.  The bug has been present since grep 2.8 so I can understand if you'd
> rather wait to fix it until later.

Hi Paul,
I'll be very happy to consider it.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Thu, 20 Feb 2014 03:46:01 GMT) Full text and rfc822 format available.

Message #68 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Padraig Brady <P <at> draigbrady.com>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 19:44:59 -0800

[Message part 1 (text/plain, inline)]

Hmm... it's not as clear-cut as I first thought.
(I built 2.17+ the above patch and put it in a directory named grep-2.18)

The following times 2.16, 2.17 and 2.17+patch two ways:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
       15.96 real        14.57 user         0.12 sys
17
        1.13 real         1.07 user         0.06 sys
18
        1.96 real         1.89 user         0.06 sys

The above search takes more than 70% longer with the proposed patch.

Contrast that with performance in the non-UTF8 ja_JP.eucJP locale:

$ yes $(printf '%078dm' 0)|head -10000 > in
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i n in; done
16
        0.03 real         0.02 user         0.00 sys
17
        2.98 real         2.96 user         0.00 sys
18
        0.02 real         0.02 user         0.00 sys

Using the jjj+foobar example, but with only 100k lines, we see there
was a 200x performance regression going from grep-2.16 to 2.17:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -100000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
        0.15 real         0.14 user         0.00 sys
17
       27.74 real        27.72 user         0.01 sys
18
        0.11 real         0.11 user         0.00 sys

Obviously, I want to retain all of 2.17's performance gain in UTF-8 locales,
while avoiding the 200x penalty in multi-byte non-UTF8 locales like ja_JP.eucJP.
So I have prepared a better patch.
With the two attached commits (on top of 2.17), I get these timings,
i.e., the same 200x improvement with ja_JP.eucJP, and no regression
with en_US.UTF8)

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.14 real         0.14 user         0.00 sys
17:        27.97 real        27.95 user         0.01 sys
18:         0.12 real         0.12 user         0.00 sys

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.13 real         0.12 user         0.00 sys
17:         0.01 real         0.01 user         0.00 sys
18:         0.01 real         0.01 user         0.00 sys

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Thu, 20 Feb 2014 13:40:02 GMT) Full text and rfc822 format available.

Message #71 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro TANAKA <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Padraig Brady <P <at> draigbrady.com>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 22:39:18 +0900

Hi Jim,

Your patch is probably right.

However, I think that the true cause for 100x slow is that DFA engine is
slower than regex engine for case-insensitive matching on a non-UTF-8
locle.

On a multibyte locale, for case-insensitive "a" grep prefers DFA engine,
but for character class "[Aa]" prefers regex engine.

Norihiro

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Thu, 20 Feb 2014 17:14:02 GMT) Full text and rfc822 format available.

Message #74 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 09:13:18 -0800

[Message part 1 (text/plain, inline)]

Hi Paul,

In case your bug fix looks safe/small, and is ready, ...

I'm hoping to release 2.18 today, with the attached commits.
Changes since yesterday: comment/log tweaks, and I've hoisted the using_utf8
test in trivial_case_ignore to precede the two memchr tests.

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Thu, 20 Feb 2014 17:25:02 GMT) Full text and rfc822 format available.

Message #77 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: jim <at> meyering.net, eggert <at> cs.ucla.edu
Cc: 16232 <at> debbugs.gnu.org
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 10:22:49 -0700

Jim Meyering <jim <at> meyering.net> wrote:

> Hi Paul,
>
> In case your bug fix looks safe/small, and is ready, ...
>
> I'm hoping to release 2.18 today, with the attached commits.
> Changes since yesterday: comment/log tweaks, and I've hoisted the using_utf8
> test in trivial_case_ignore to precede the two memchr tests.

Hi Jim.

Why copy the using_utf8() routine out of dfa.c?  Why not just link
to it instead?  If it's static, make it extern... That way if the
logic ever changes then it only has to be changed in one place.

Just a thought. :-)

Arnold

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Thu, 20 Feb 2014 18:36:02 GMT) Full text and rfc822 format available.

Message #80 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes
 10x) in multibyte locales
Date: Thu, 20 Feb 2014 10:34:48 -0800

[Message part 1 (text/plain, inline)]

On 02/20/2014 09:13 AM, Jim Meyering wrote:
> In case your bug fix looks safe/small, and is ready, ...
Attached.  I have some other fixes too, which I'll try to get out the 
door today (though I can't promise that).

[0001-tests-test-in-unibyte-locales.patch (text/x-patch, attachment)]

[0002-grep-fix-bug-with-patterns-like-in-unibyte-locales.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Thu, 20 Feb 2014 20:43:02 GMT) Full text and rfc822 format available.

Message #83 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aharon Robbins <arnold <at> skeeve.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 12:42:06 -0800

[Message part 1 (text/plain, inline)]

On Thu, Feb 20, 2014 at 9:22 AM,  <arnold <at> skeeve.com> wrote:
> Hi Jim.
>
> Why copy the using_utf8() routine out of dfa.c?  Why not just link
> to it instead?  If it's static, make it extern... That way if the
> logic ever changes then it only has to be changed in one place.

Hi Arnold,

That was due to my reflex of avoiding unnecessary change to dfa.c,
but in this case, it is definitely better to do as you suggest, not
just to avoid code duplication, but also for run-time efficiency:
with two copies of the function, there would have been two calls to
nl_langinfo per run; with only that one copy, we save a call, too.

Revised commits attached.

Thanks,
Jim

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Fri, 21 Feb 2014 00:42:02 GMT) Full text and rfc822 format available.

Message #86 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 16:40:58 -0800

On Thu, Feb 20, 2014 at 10:34 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 02/20/2014 09:13 AM, Jim Meyering wrote:
>>
>> In case your bug fix looks safe/small, and is ready, ...
>
> Attached.  I have some other fixes too, which I'll try to get out the door
> today (though I can't promise that).

Thank you, Paul.  Those looked perfect.
I reversed the order, putting the otherwise-failing test after its fix, and
pushed both.

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Fri, 21 Feb 2014 00:46:02 GMT) Full text and rfc822 format available.

Message #89 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aharon Robbins <arnold <at> skeeve.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 16:45:20 -0800

[Message part 1 (text/plain, inline)]

On Thu, Feb 20, 2014 at 12:42 PM, Jim Meyering <jim <at> meyering.net> wrote:
> Revised commits attached.

One more revision.  This is a big enough deal (and subtle enough) that
I thought I really should add a test for it.  Did that, so now it's 3
commits:

I'm going to push these in an hour or so, then see if I have time to
make the release this evening.

[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16232; Package grep. (Fri, 21 Feb 2014 00:56:02 GMT) Full text and rfc822 format available.

Message #92 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aharon Robbins <arnold <at> skeeve.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 16:54:54 -0800

PS. if anyone has really slow hardware on which they can
exercise this test, I'd appreciate your letting me know if you can make
this new test fail.  I chose the "4-second" limit presuming that there are
few systems slow enough that they'd take longer than that.

Technically, you could probably test with grep-2.16, if that's easier
(but not 2.17).  Run these commands:

  grep --version
  yes $(printf '%078d' 0)|head -50000 > k
  env LC_ALL=ja_JP.eucJP time grep -i foobar k

[use src/grep in place of "grep" above, if you're testing just-build grep
 -- but in that case, you can just run "make check"]

and let me know if it takes longer than 4 seconds.
Ideally, I would compare timings of two commands, one with the offending
locale, and the other with LC_ALL=C.  That would eliminate the scaling issue.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sat, 08 Mar 2014 02:49:02 GMT) Full text and rfc822 format available.

Notification sent to Jim Meyering <jim <at> meyering.net>:
bug acknowledged by developer. (Sat, 08 Mar 2014 02:49:02 GMT) Full text and rfc822 format available.

Message #97 received at 16232-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16232-done <at> debbugs.gnu.org
Subject: Re: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)
 in multibyte locales
Date: Fri, 07 Mar 2014 18:48:23 -0800

This patch has been installed, so I'm taking the liberty of closing the 
bug report.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 05 Apr 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 82 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16232 [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales