GNU bug report logs -
#16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Previous Next
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Mon, 23 Dec 2013 22:40:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16232 in the body.
You can then email your comments to 16232 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Mon, 23 Dec 2013 22:40:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Jim Meyering <jim <at> meyering.net>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Mon, 23 Dec 2013 22:40:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
FYI, here is a quick and clean/safe performance improvement for grep -i.
I expect to push this commit right after the upcoming bug-fix release.
Currently, this optimization is enabled when the search string is
ASCII and contains neither of '\' (backslash) nor '['. I expect to
eliminate the latter two constraints in a follow-on commit including
tests to exercise all of the corner cases.
Happy holidays,
Jim
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Mon, 23 Dec 2013 22:53:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 12/23/2013 03:39 PM, Jim Meyering wrote:
> FYI, here is a quick and clean/safe performance improvement for grep -i.
> I expect to push this commit right after the upcoming bug-fix release.
> Currently, this optimization is enabled when the search string is
> ASCII and contains neither of '\' (backslash) nor '['. I expect to
> eliminate the latter two constraints in a follow-on commit including
> tests to exercise all of the corner cases.
>
> +
> + /* Worst case is that every byte of keys will be alpha,
> + so every byte B will map to the sequence of 4 bytes [Bb]. */
Umm, is this always true? Consider the UTF-8 Turkish locale, where
single-byte i has a multi-byte uppercase (and conversely the single-byte
I has a multi-byte lowercase) - that is, 'i' and 'I' are not case pairs.
> + else
> + {
> + *p++ = '[';
> + *p++ = *keys;
> + *p++ = islower (*keys) ? toupper (*keys) : tolower (*keys);
This performs the ASCII-only toupper/tolower, rather than the proper
locale-sensitive case conversion, which probably makes this patch
misbehave for LC_ALL=tr_TR.UTF-8.
>
> + /* As currently implemented, case-insensitive matching is expensive in
> + multi-byte locales because of a few outlier locales in which some
> + characters change size when converted to upper or lower case. To
> + accommodate those, we revert to searching the input one line at a
> + time, rather than using the much more efficient buffer search.
> + However, if we have a plain ascii search string, /foo/, we can
> + convert it to an equivalent case-insensitive /[fF][oO][oO]/, and thus
> + avoid the expensive read-and-process-a-line-at-a-time requirement.
> + Optimize-away the "-i" option, when possible, converting each
> + candidate alpha, C, in the regexp to [Cc]. */
In other words, this comment describes the very flaw of the outlier
tr_TR locale that you are now violating with your optimization.
Without your patch, this is correct behavior:
$ echo i | LC_ALL=tr_TR.UTF-8 grep -i I
$ echo i | LC_ALL=tr_TR.UTF-8 grep -i i
i
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Mon, 23 Dec 2013 23:13:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Mon, Dec 23, 2013 at 2:52 PM, Eric Blake <eblake <at> redhat.com> wrote:
> On 12/23/2013 03:39 PM, Jim Meyering wrote:
>> FYI, here is a quick and clean/safe performance improvement for grep -i.
>> I expect to push this commit right after the upcoming bug-fix release.
>> Currently, this optimization is enabled when the search string is
>> ASCII and contains neither of '\' (backslash) nor '['. I expect to
>> eliminate the latter two constraints in a follow-on commit including
>> tests to exercise all of the corner cases.
>>
>
>> +
>> + /* Worst case is that every byte of keys will be alpha,
>> + so every byte B will map to the sequence of 4 bytes [Bb]. */
>
> Umm, is this always true? Consider the UTF-8 Turkish locale, where
Hi Eric,
Thanks for the review.
Did you miss the "isascii" check in the new trivial_case_convert function?
If you can describe circumstances in which the new patch malfunctions,
please do,
but everything you wrote seems to rely on a false assumption.
E.g., your turkish-I example works fine with my patch.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Mon, 23 Dec 2013 23:31:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 12/23/2013 04:12 PM, Jim Meyering wrote:
> Did you miss the "isascii" check in the new trivial_case_convert function?
No. But even with that check in place:
> If you can describe circumstances in which the new patch malfunctions,
> please do,
> but everything you wrote seems to rely on a false assumption.
No, it's a quite real complaint - your patch is broken for tr_TR.
> E.g., your turkish-I example works fine with my patch.
isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
tr_TR locale. Rather, the conversion must be to '[iİ]'; similarly, 'I'
would be translated to '[Iı]'. Neither of those conversions fit in 4
bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
characters).
Need help easily finding those characters on a non-Turkish keyboard? I
used:
$ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
At any rate, prior to your patch, lower dotless i in the buffer gives an
insensitive match to upper dotless I in the pattern:
$ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
ı
After your patch:
$ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
no match
Oops, you failed to match lower dotless i insensitively against upper
dotless I, because upper dotless I is ascii, but you incorrectly
converted it into the wrong pattern.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Mon, 23 Dec 2013 23:55:03 GMT)
Full text and
rfc822 format available.
Message #17 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Mon, Dec 23, 2013 at 3:30 PM, Eric Blake <eblake <at> redhat.com> wrote:
> On 12/23/2013 04:12 PM, Jim Meyering wrote:
>> Did you miss the "isascii" check in the new trivial_case_convert function?
>
> No. But even with that check in place:
>
>> If you can describe circumstances in which the new patch malfunctions,
>> please do,
>> but everything you wrote seems to rely on a false assumption.
>
> No, it's a quite real complaint - your patch is broken for tr_TR.
>
>> E.g., your turkish-I example works fine with my patch.
>
> isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
> tr_TR locale. Rather, the conversion must be to '[iİ]'; similarly, 'I'
> would be translated to '[Iı]'. Neither of those conversions fit in 4
> bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
> characters).
>
> Need help easily finding those characters on a non-Turkish keyboard? I
> used:
> $ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
>
> At any rate, prior to your patch, lower dotless i in the buffer gives an
> insensitive match to upper dotless I in the pattern:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
> ı
>
> After your patch:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
> no match
>
> Oops, you failed to match lower dotless i insensitively against upper
> dotless I, because upper dotless I is ascii, but you incorrectly
> converted it into the wrong pattern.
Thanks for dotting those 'i's. While there is no risk of buffer
overrun, there would definitely be a problem with the tr_TR locale.
I will resolve it by removing the isascii check and performing
multibyte case conversion to form each [cC] pair. Of course,
that will mean removing the "4 * byte-length-of-search-string"
buffer size limitation.
I will also add tests based on your examples.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Wed, 08 Jan 2014 03:57:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Mon, Dec 23, 2013 at 3:54 PM, Jim Meyering <jim <at> meyering.net> wrote:
...
> Thanks for dotting those 'i's. While there is no risk of buffer
> overrun, there would definitely be a problem with the tr_TR locale.
> I will resolve it by removing the isascii check and performing
> multibyte case conversion to form each [cC] pair. Of course,
> that will mean removing the "4 * byte-length-of-search-string"
> buffer size limitation.
>
> I will also add tests based on your examples.
Here is the improved patch.
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Fri, 10 Jan 2014 05:20:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Tue, Jan 7, 2014 at 7:56 PM, Jim Meyering <jim <at> meyering.net> wrote:
> Here is the improved patch.
I've added a NEWS entry. Pushing tomorrow:
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sat, 11 Jan 2014 01:50:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 16232 <at> debbugs.gnu.org (full text, mbox):
Cool so it does this transformation:
sed 's/./[\L&\U&]/g'
Though multi byte case handling has all sorts of edge cases (pardon the pun),
and it may not be always valid to treat each character independently?
For example see some of the tests in:
http://git.sv.gnu.org/gitweb/?p=gnulib.git;a=blob;f=tests/unicase/test-ulc-casecmp.c;hb=HEAD
I wonder might this faster path be restricted to a safer but very common input subset of:
(MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
Also are the following printfs in the test redundant?
> +data=$( printf "I:$I $i:i")
> +search_str=$(printf "$i:i I:$I")
nice improvement!
Pádraig.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sat, 11 Jan 2014 04:53:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Fri, Jan 10, 2014 at 5:49 PM, Pádraig Brady <P <at> draigbrady.com> wrote:
> Cool so it does this transformation:
>
> sed 's/./[\L&\U&]/g'
>
> Though multi byte case handling has all sorts of edge cases (pardon the pun),
> and it may not be always valid to treat each character independently?
> For example see some of the tests in:
> http://git.sv.gnu.org/gitweb/?p=gnulib.git;a=blob;f=tests/unicase/test-ulc-casecmp.c;hb=HEAD
It seems you're right. Since it's a many-to-one mapping in some
cases, simply using one lower case character and one upper case
version won't cover all possibilities.
> I wonder might this faster path be restricted to a safer but very common input subset of:
>
> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
That sounds like a good approach.
Now I need another test case, to demonstrate that the current code can
cause trouble.
> Also are the following printfs in the test redundant?
>
>> +data=$( printf "I:$I $i:i")
>> +search_str=$(printf "$i:i I:$I")
Good catch. Those were vestiges of pre-factoring code, where they
were needed. Here's the patch to fix that part, in your name:
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sat, 11 Jan 2014 05:41:01 GMT)
Full text and
rfc822 format available.
Message #32 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>
>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>
> That sounds like a good approach.
> Now I need another test case, to demonstrate that the current code can
> cause trouble.
Hmm... after thinking about this for a while and actually trying to
break the current code (did not find a way to demonstrate a regression),
I have concluded that the current approach is no worse than the prior
one of matching a case-mapped regexp vs. each case-mapped input line.
That's not to say that it's perfect, of course.
The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
from gnulib's test-ulc-casecmp.c is a great example: this matches:
printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
'\x6A\xCC\x8C\xCC\xA3')"
but this does not, yet probably should:
printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
Can you see a way to demonstrate a regression?
Thanks again,
Jim
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sat, 11 Jan 2014 11:34:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On 01/11/2014 05:40 AM, Jim Meyering wrote:
> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>
>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>
>> That sounds like a good approach.
>> Now I need another test case, to demonstrate that the current code can
>> cause trouble.
>
> Hmm... after thinking about this for a while and actually trying to
> break the current code (did not find a way to demonstrate a regression),
> I have concluded that the current approach is no worse than the prior
> one of matching a case-mapped regexp vs. each case-mapped input line.
>
> That's not to say that it's perfect, of course.
> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>
> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
> '\x6A\xCC\x8C\xCC\xA3')"
>
> but this does not, yet probably should:
>
> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
>
> Can you see a way to demonstrate a regression?
Oh right, it doesn't handle these cases already.
Fair enough I don't see a regression then.
+1
Pádraig.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sat, 11 Jan 2014 14:17:01 GMT)
Full text and
rfc822 format available.
Message #38 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>>
>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>
>>> That sounds like a good approach.
>>> Now I need another test case, to demonstrate that the current code can
>>> cause trouble.
>>
>> Hmm... after thinking about this for a while and actually trying to
>> break the current code (did not find a way to demonstrate a regression),
>> I have concluded that the current approach is no worse than the prior
>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>
>> That's not to say that it's perfect, of course.
>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>
>> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> but this does not, yet probably should:
>>
>> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
>>
>> Can you see a way to demonstrate a regression?
>
> Oh right, it doesn't handle these cases already.
> Fair enough I don't see a regression then.
This is also a good summary of stuff to consider with case:
http://www.unicode.org/faq/casemap_charprop.html
So picking another case situation from there:
"in the Greek script, capital sigma (U+03A3) is the uppercase form of both
the regular (U+03C2) and final (U+03C3) lowercase sigma."
One can see that sed handles this:
$ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
ςσΣΣ
$ printf '\u03A3\n' | sed 's/.*/&\L&/'
Σσ
Though I was surprised the grep (2.14) didn't match any combo of these
$ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
$ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
$ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
Not a regression of course.
cheers,
Pádraig.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sat, 11 Jan 2014 17:58:01 GMT)
Full text and
rfc822 format available.
Message #41 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
>> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>>>
>>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>>
>>>> That sounds like a good approach.
>>>> Now I need another test case, to demonstrate that the current code can
>>>> cause trouble.
>>>
>>> Hmm... after thinking about this for a while and actually trying to
>>> break the current code (did not find a way to demonstrate a regression),
>>> I have concluded that the current approach is no worse than the prior
>>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>>
>>> That's not to say that it's perfect, of course.
>>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>>
>>> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>>> '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> but this does not, yet probably should:
>>>
>>> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> Can you see a way to demonstrate a regression?
>>
>> Oh right, it doesn't handle these cases already.
>> Fair enough I don't see a regression then.
>
> This is also a good summary of stuff to consider with case:
> http://www.unicode.org/faq/casemap_charprop.html
>
> So picking another case situation from there:
> "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
> the regular (U+03C2) and final (U+03C3) lowercase sigma."
>
> One can see that sed handles this:
> $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
> ςσΣΣ
> $ printf '\u03A3\n' | sed 's/.*/&\L&/'
> Σσ
>
> Though I was surprised the grep (2.14) didn't match any combo of these
> $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
> $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
> $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
>
> Not a regression of course.
Thank you for the reference and the fine examples.
I'll add the latter as a known-failing test.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sun, 12 Jan 2014 04:37:01 GMT)
Full text and
rfc822 format available.
Message #44 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
...
> This is also a good summary of stuff to consider with case:
> http://www.unicode.org/faq/casemap_charprop.html
>
> So picking another case situation from there:
> "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
> the regular (U+03C2) and final (U+03C3) lowercase sigma."
>
> One can see that sed handles this:
> $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
> ςσΣΣ
> $ printf '\u03A3\n' | sed 's/.*/&\L&/'
> Σσ
>
> Though I was surprised the grep (2.14) didn't match any combo of these
> $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
> $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
> $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
Actually, if you quote the argument to the latter printf, two of those
do match, both with -F and without:
$ printf '\u03C2\u03C3\n' | grep -Fi "$(printf '\u03A3')"
ςσ
$ printf '\u03A3\n' | grep -Fi "$(printf '\u03C2')"
$ printf '\u03A3\n' | grep -Fi "$(printf '\u03C3')"
Σ
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Sun, 12 Jan 2014 12:57:01 GMT)
Full text and
rfc822 format available.
Message #47 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On 01/12/2014 04:36 AM, Jim Meyering wrote:
> On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
>> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> ...
>> This is also a good summary of stuff to consider with case:
>> http://www.unicode.org/faq/casemap_charprop.html
>>
>> So picking another case situation from there:
>> "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
>> the regular (U+03C2) and final (U+03C3) lowercase sigma."
>>
>> One can see that sed handles this:
>> $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
>> ςσΣΣ
>> $ printf '\u03A3\n' | sed 's/.*/&\L&/'
>> Σσ
>>
>> Though I was surprised the grep (2.14) didn't match any combo of these
>> $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
>> $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
>> $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
>
> Actually, if you quote the argument to the latter printf, two of those
> do match, both with -F and without:
>
> $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf '\u03A3')"
> ςσ
> $ printf '\u03A3\n' | grep -Fi "$(printf '\u03C2')"
> $ printf '\u03A3\n' | grep -Fi "$(printf '\u03C3')"
> Σ
Oops right.
So that's still no regression with the new scheme
since grep is 1:1 here for Σ and σ.
thanks,
Pádraig.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Wed, 19 Feb 2014 14:23:01 GMT)
Full text and
rfc822 format available.
Message #50 received at 16232 <at> debbugs.gnu.org (full text, mbox):
Hi,
Slow down may be caused by the patch, because MBCSET is processed by not
DFA engine but regexp engine.
I tested performance on grep-2.17 and the version which the patch is reverted.
Latter is 100x faster.
yes $(printf '%078dm' 0)|head -10000 > in
grep-2.17 original:
$ for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done
Command exited with non-zero status 1
5.92user 1.69system 0:07.73elapsed 98%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps
Command exited with non-zero status 1
5.59user 1.87system 0:07.58elapsed 98%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
6.06user 1.58system 0:07.81elapsed 97%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
5.73user 1.66system 0:07.52elapsed 98%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps
Command exited with non-zero status 1
6.42user 1.19system 0:07.86elapsed 96%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
6.15user 1.56system 0:08.34elapsed 92%CPU (0avgtext+0avgdata 3888maxresident)k
0inputs+0outputs (0major+424minor)pagefaults 0swaps
Command exited with non-zero status 1
6.97user 0.61system 0:07.77elapsed 97%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps
Command exited with non-zero status 1
7.00user 0.57system 0:07.71elapsed 98%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
7.16user 0.25system 0:07.56elapsed 97%CPU (0avgtext+0avgdata 3872maxresident)k
0inputs+0outputs (0major+423minor)pagefaults 0swaps
Command exited with non-zero status 1
7.04user 0.39system 0:07.60elapsed 97%CPU (0avgtext+0avgdata 3856maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps
After revert the patch:
$ for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done
Command exited with non-zero status 1
0.07user 0.02system 0:00.10elapsed 92%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+232minor)pagefaults 0swaps
Command exited with non-zero status 1
0.03user 0.01system 0:00.05elapsed 101%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 90%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.03user 0.02system 0:00.05elapsed 103%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 86%CPU (0avgtext+0avgdata 3088maxresident)k
0inputs+0outputs (0major+219minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 91%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 87%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps
Command exited with non-zero status 1
0.03user 0.02system 0:00.05elapsed 105%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.00system 0:00.06elapsed 90%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+218minor)pagefaults 0swaps
Command exited with non-zero status 1
0.04user 0.01system 0:00.06elapsed 90%CPU (0avgtext+0avgdata 3056maxresident)k
0inputs+0outputs (0major+217minor)pagefaults 0swaps
Norihiro
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Wed, 19 Feb 2014 18:31:02 GMT)
Full text and
rfc822 format available.
Message #53 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On 02/19/2014 06:22 AM, Norihiro Tanaka wrote:
> I tested performance on grep-2.17 and the version which the patch is reverted.
> Latter is 100x faster.
While we're on the topic, can someone please say why that patch was done
as a preprocessor for the regex code? I normally would think that the
way to speed up regex processing is to improve dfa.c and/or the regex
library, as that should speed up all programs that use these modules,
not just grep.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Wed, 19 Feb 2014 19:18:02 GMT)
Full text and
rfc822 format available.
Message #56 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Wed, Feb 19, 2014 at 6:22 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done
Wow. You're right. With the attached patch, I see a speedup of more
than 130x in this case: (fyi, the "time" output is slightly different,
because I have installed GNU time)
grep-2.17$ for i in $(seq 5); do env LC_ALL=ja_JP.eucJP time grep -i n in; done
2.78 real 2.78 user 0.00 sys
2.73 real 2.73 user 0.00 sys
2.75 real 2.75 user 0.00 sys
2.73 real 2.73 user 0.00 sys
2.74 real 2.74 user 0.00 sys
2.17+patch$ for i in $(seq 5); do env LC_ALL=ja_JP.eucJP time src/grep
-i n in; done
0.02 real 0.02 user 0.00 sys
0.02 real 0.02 user 0.00 sys
0.02 real 0.02 user 0.00 sys
0.02 real 0.02 user 0.00 sys
0.02 real 0.02 user 0.00 sys
I haven't investigated they "why" yet, but expect that I will make
grep-2.18 with just this one performance-improving patch.
Thank you, Norihiro,
Jim
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Wed, 19 Feb 2014 19:52:01 GMT)
Full text and
rfc822 format available.
Message #59 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Wed, Feb 19, 2014 at 10:30 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> While we're on the topic, can someone please say why that patch was done as
> a preprocessor for the regex code? I normally would think that the way to
> speed up regex processing is to improve dfa.c and/or the regex library, as
> that should speed up all programs that use these modules, not just grep.
Hi Paul,
My patch was a first crack at dealing with an inefficiency that was
grep-specific, so it made sense to apply as a grep-specific preprocessing
phase. Norihiro found an additional improvement, that, it so happens,
would have had an even bigger impact if my patch had not been applied first.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Wed, 19 Feb 2014 19:57:01 GMT)
Full text and
rfc822 format available.
Message #62 received at submit <at> debbugs.gnu.org (full text, mbox):
On 02/19/2014 11:17 AM, Jim Meyering wrote:
> I haven't investigated they "why" yet, but expect that I will make
> grep-2.18 with just this one performance-improving patch.
Can you make room for one more patch? I have a test case and bugfix for
grep's mishandling of regular expressions like [^^-~] in unibyte locales
like the C locale on most platforms; this is due to the range-expression
bug that Arnold reported recently.I can send out the patch later today
if you like. The bug has been present since grep 2.8 so I can
understand if you'd rather wait to fix it until later.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Wed, 19 Feb 2014 20:02:02 GMT)
Full text and
rfc822 format available.
Message #65 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Wed, Feb 19, 2014 at 11:55 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 02/19/2014 11:17 AM, Jim Meyering wrote:
>>
>> I haven't investigated they "why" yet, but expect that I will make
>> grep-2.18 with just this one performance-improving patch.
>
> Can you make room for one more patch? I have a test case and bugfix for
> grep's mishandling of regular expressions like [^^-~] in unibyte locales
> like the C locale on most platforms; this is due to the range-expression bug
> that Arnold reported recently.I can send out the patch later today if you
> like. The bug has been present since grep 2.8 so I can understand if you'd
> rather wait to fix it until later.
Hi Paul,
I'll be very happy to consider it.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Thu, 20 Feb 2014 03:46:01 GMT)
Full text and
rfc822 format available.
Message #68 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hmm... it's not as clear-cut as I first thought.
(I built 2.17+ the above patch and put it in a directory named grep-2.18)
The following times 2.16, 2.17 and 2.17+patch two ways:
$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
15.96 real 14.57 user 0.12 sys
17
1.13 real 1.07 user 0.06 sys
18
1.96 real 1.89 user 0.06 sys
The above search takes more than 70% longer with the proposed patch.
Contrast that with performance in the non-UTF8 ja_JP.eucJP locale:
$ yes $(printf '%078dm' 0)|head -10000 > in
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i n in; done
16
0.03 real 0.02 user 0.00 sys
17
2.98 real 2.96 user 0.00 sys
18
0.02 real 0.02 user 0.00 sys
Using the jjj+foobar example, but with only 100k lines, we see there
was a 200x performance regression going from grep-2.16 to 2.17:
$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -100000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
0.15 real 0.14 user 0.00 sys
17
27.74 real 27.72 user 0.01 sys
18
0.11 real 0.11 user 0.00 sys
Obviously, I want to retain all of 2.17's performance gain in UTF-8 locales,
while avoiding the 200x penalty in multi-byte non-UTF8 locales like ja_JP.eucJP.
So I have prepared a better patch.
With the two attached commits (on top of 2.17), I get these timings,
i.e., the same 200x improvement with ja_JP.eucJP, and no regression
with en_US.UTF8)
$ for i in 16 17 18; do printf "$i: "; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16: 0.14 real 0.14 user 0.00 sys
17: 27.97 real 27.95 user 0.01 sys
18: 0.12 real 0.12 user 0.00 sys
$ for i in 16 17 18; do printf "$i: "; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16: 0.13 real 0.12 user 0.00 sys
17: 0.01 real 0.01 user 0.00 sys
18: 0.01 real 0.01 user 0.00 sys
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Thu, 20 Feb 2014 13:40:02 GMT)
Full text and
rfc822 format available.
Message #71 received at 16232 <at> debbugs.gnu.org (full text, mbox):
Hi Jim,
Your patch is probably right.
However, I think that the true cause for 100x slow is that DFA engine is
slower than regex engine for case-insensitive matching on a non-UTF-8
locle.
On a multibyte locale, for case-insensitive "a" grep prefers DFA engine,
but for character class "[Aa]" prefers regex engine.
Norihiro
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Thu, 20 Feb 2014 17:14:02 GMT)
Full text and
rfc822 format available.
Message #74 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi Paul,
In case your bug fix looks safe/small, and is ready, ...
I'm hoping to release 2.18 today, with the attached commits.
Changes since yesterday: comment/log tweaks, and I've hoisted the using_utf8
test in trivial_case_ignore to precede the two memchr tests.
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Thu, 20 Feb 2014 17:25:02 GMT)
Full text and
rfc822 format available.
Message #77 received at 16232 <at> debbugs.gnu.org (full text, mbox):
Jim Meyering <jim <at> meyering.net> wrote:
> Hi Paul,
>
> In case your bug fix looks safe/small, and is ready, ...
>
> I'm hoping to release 2.18 today, with the attached commits.
> Changes since yesterday: comment/log tweaks, and I've hoisted the using_utf8
> test in trivial_case_ignore to precede the two memchr tests.
Hi Jim.
Why copy the using_utf8() routine out of dfa.c? Why not just link
to it instead? If it's static, make it extern... That way if the
logic ever changes then it only has to be changed in one place.
Just a thought. :-)
Arnold
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Thu, 20 Feb 2014 18:36:02 GMT)
Full text and
rfc822 format available.
Message #80 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/20/2014 09:13 AM, Jim Meyering wrote:
> In case your bug fix looks safe/small, and is ready, ...
Attached. I have some other fixes too, which I'll try to get out the
door today (though I can't promise that).
[0001-tests-test-in-unibyte-locales.patch (text/x-patch, attachment)]
[0002-grep-fix-bug-with-patterns-like-in-unibyte-locales.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Thu, 20 Feb 2014 20:43:02 GMT)
Full text and
rfc822 format available.
Message #83 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, Feb 20, 2014 at 9:22 AM, <arnold <at> skeeve.com> wrote:
> Hi Jim.
>
> Why copy the using_utf8() routine out of dfa.c? Why not just link
> to it instead? If it's static, make it extern... That way if the
> logic ever changes then it only has to be changed in one place.
Hi Arnold,
That was due to my reflex of avoiding unnecessary change to dfa.c,
but in this case, it is definitely better to do as you suggest, not
just to avoid code duplication, but also for run-time efficiency:
with two copies of the function, there would have been two calls to
nl_langinfo per run; with only that one copy, we save a call, too.
Revised commits attached.
Thanks,
Jim
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Fri, 21 Feb 2014 00:42:02 GMT)
Full text and
rfc822 format available.
Message #86 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On Thu, Feb 20, 2014 at 10:34 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 02/20/2014 09:13 AM, Jim Meyering wrote:
>>
>> In case your bug fix looks safe/small, and is ready, ...
>
> Attached. I have some other fixes too, which I'll try to get out the door
> today (though I can't promise that).
Thank you, Paul. Those looked perfect.
I reversed the order, putting the otherwise-failing test after its fix, and
pushed both.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Fri, 21 Feb 2014 00:46:02 GMT)
Full text and
rfc822 format available.
Message #89 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, Feb 20, 2014 at 12:42 PM, Jim Meyering <jim <at> meyering.net> wrote:
> Revised commits attached.
One more revision. This is a big enough deal (and subtle enough) that
I thought I really should add a test for it. Did that, so now it's 3
commits:
I'm going to push these in an hour or so, then see if I have time to
make the release this evening.
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16232
; Package
grep
.
(Fri, 21 Feb 2014 00:56:02 GMT)
Full text and
rfc822 format available.
Message #92 received at 16232 <at> debbugs.gnu.org (full text, mbox):
PS. if anyone has really slow hardware on which they can
exercise this test, I'd appreciate your letting me know if you can make
this new test fail. I chose the "4-second" limit presuming that there are
few systems slow enough that they'd take longer than that.
Technically, you could probably test with grep-2.16, if that's easier
(but not 2.17). Run these commands:
grep --version
yes $(printf '%078d' 0)|head -50000 > k
env LC_ALL=ja_JP.eucJP time grep -i foobar k
[use src/grep in place of "grep" above, if you're testing just-build grep
-- but in that case, you can just run "make check"]
and let me know if it takes longer than 4 seconds.
Ideally, I would compare timings of two commands, one with the offending
locale, and the other with LC_ALL=C. That would eliminate the scaling issue.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Sat, 08 Mar 2014 02:49:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Jim Meyering <jim <at> meyering.net>
:
bug acknowledged by developer.
(Sat, 08 Mar 2014 02:49:02 GMT)
Full text and
rfc822 format available.
Message #97 received at 16232-done <at> debbugs.gnu.org (full text, mbox):
This patch has been installed, so I'm taking the liberty of closing the
bug report.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 05 Apr 2014 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 years and 82 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.