GNU bug report logs -
#16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Previous Next
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Mon, 23 Dec 2013 22:40:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
On 12/23/2013 04:12 PM, Jim Meyering wrote:
> Did you miss the "isascii" check in the new trivial_case_convert function?
No. But even with that check in place:
> If you can describe circumstances in which the new patch malfunctions,
> please do,
> but everything you wrote seems to rely on a false assumption.
No, it's a quite real complaint - your patch is broken for tr_TR.
> E.g., your turkish-I example works fine with my patch.
isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
tr_TR locale. Rather, the conversion must be to '[iİ]'; similarly, 'I'
would be translated to '[Iı]'. Neither of those conversions fit in 4
bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
characters).
Need help easily finding those characters on a non-Turkish keyboard? I
used:
$ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
At any rate, prior to your patch, lower dotless i in the buffer gives an
insensitive match to upper dotless I in the pattern:
$ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
ı
After your patch:
$ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
no match
Oops, you failed to match lower dotless i insensitively against upper
dotless I, because upper dotless I is ascii, but you incorrectly
converted it into the wrong pattern.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
This bug report was last modified 11 years and 82 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.