GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Dec 2013 22:40:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Jim Meyering <jim <at> meyering.net>
To: Eric Blake <eblake <at> redhat.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:54:27 -0800
On Mon, Dec 23, 2013 at 3:30 PM, Eric Blake <eblake <at> redhat.com> wrote:
> On 12/23/2013 04:12 PM, Jim Meyering wrote:
>> Did you miss the "isascii" check in the new trivial_case_convert function?
>
> No.  But even with that check in place:
>
>> If you can describe circumstances in which the new patch malfunctions,
>> please do,
>> but everything you wrote seems to rely on a false assumption.
>
> No, it's a quite real complaint - your patch is broken for tr_TR.
>
>> E.g., your turkish-I example works fine with my patch.
>
> isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
> tr_TR locale.  Rather, the conversion must be to '[iİ]'; similarly, 'I'
> would be translated to '[Iı]'.  Neither of those conversions fit in 4
> bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
> characters).
>
> Need help easily finding those characters on a non-Turkish keyboard?  I
> used:
> $ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
>
> At any rate, prior to your patch, lower dotless i in the buffer gives an
> insensitive match to upper dotless I in the pattern:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
> ı
>
> After your patch:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
> no match
>
> Oops, you failed to match lower dotless i insensitively against upper
> dotless I, because upper dotless I is ascii, but you incorrectly
> converted it into the wrong pattern.

Thanks for dotting those 'i's.  While there is no risk of buffer
overrun, there would definitely be a problem with the tr_TR locale.
I will resolve it by removing the isascii check and performing
multibyte case conversion to form each [cC] pair.  Of course,
that will mean removing the "4 * byte-length-of-search-string"
buffer size limitation.

I will also add tests based on your examples.




This bug report was last modified 11 years and 82 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.