GNU bug report logs -
#16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Previous Next
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Mon, 23 Dec 2013 22:40:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #38 received at 16232 <at> debbugs.gnu.org (full text, mbox):
On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>>
>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>
>>> That sounds like a good approach.
>>> Now I need another test case, to demonstrate that the current code can
>>> cause trouble.
>>
>> Hmm... after thinking about this for a while and actually trying to
>> break the current code (did not find a way to demonstrate a regression),
>> I have concluded that the current approach is no worse than the prior
>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>
>> That's not to say that it's perfect, of course.
>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>
>> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> but this does not, yet probably should:
>>
>> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
>>
>> Can you see a way to demonstrate a regression?
>
> Oh right, it doesn't handle these cases already.
> Fair enough I don't see a regression then.
This is also a good summary of stuff to consider with case:
http://www.unicode.org/faq/casemap_charprop.html
So picking another case situation from there:
"in the Greek script, capital sigma (U+03A3) is the uppercase form of both
the regular (U+03C2) and final (U+03C3) lowercase sigma."
One can see that sed handles this:
$ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
ςσΣΣ
$ printf '\u03A3\n' | sed 's/.*/&\L&/'
Σσ
Though I was surprised the grep (2.14) didn't match any combo of these
$ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
$ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
$ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
Not a regression of course.
cheers,
Pádraig.
This bug report was last modified 11 years and 82 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.