GNU bug report logs -
#16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Previous Next
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Mon, 23 Dec 2013 22:40:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #29 received at 16232 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Fri, Jan 10, 2014 at 5:49 PM, Pádraig Brady <P <at> draigbrady.com> wrote:
> Cool so it does this transformation:
>
> sed 's/./[\L&\U&]/g'
>
> Though multi byte case handling has all sorts of edge cases (pardon the pun),
> and it may not be always valid to treat each character independently?
> For example see some of the tests in:
> http://git.sv.gnu.org/gitweb/?p=gnulib.git;a=blob;f=tests/unicase/test-ulc-casecmp.c;hb=HEAD
It seems you're right. Since it's a many-to-one mapping in some
cases, simply using one lower case character and one upper case
version won't cover all possibilities.
> I wonder might this faster path be restricted to a safer but very common input subset of:
>
> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
That sounds like a good approach.
Now I need another test case, to demonstrate that the current code can
cause trouble.
> Also are the following printfs in the test redundant?
>
>> +data=$( printf "I:$I $i:i")
>> +search_str=$(printf "$i:i I:$I")
Good catch. Those were vestiges of pre-factoring code, where they
were needed. Here's the patch to fix that part, in your name:
[k.txt (text/plain, attachment)]
This bug report was last modified 11 years and 82 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.