GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Dec 2013 22:40:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #32 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster
 (sometimes 10x) in multibyte locales
Date: Fri, 10 Jan 2014 21:40:24 -0800
On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>
>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>
> That sounds like a good approach.
> Now I need another test case, to demonstrate that the current code can
> cause trouble.

Hmm... after thinking about this for a while and actually trying to
break the current code (did not find a way to demonstrate a regression),
I have concluded that the current approach is no worse than the prior
one of matching a case-mapped regexp vs. each case-mapped input line.

That's not to say that it's perfect, of course.
The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
from gnulib's test-ulc-casecmp.c is a great example: this matches:

    printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
'\x6A\xCC\x8C\xCC\xA3')"

but this does not, yet probably should:

    printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"

Can you see a way to demonstrate a regression?

Thanks again,

Jim




This bug report was last modified 11 years and 82 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.