#16232 - [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Dec 2013 22:40:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #8 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com> To: Jim Meyering <jim <at> meyering.net>, 16232 <at> debbugs.gnu.org Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales Date: Mon, 23 Dec 2013 15:52:26 -0700

[Message part 1 (text/plain, inline)]

On 12/23/2013 03:39 PM, Jim Meyering wrote: > FYI, here is a quick and clean/safe performance improvement for grep -i. > I expect to push this commit right after the upcoming bug-fix release. > Currently, this optimization is enabled when the search string is > ASCII and contains neither of '\' (backslash) nor '['. I expect to > eliminate the latter two constraints in a follow-on commit including > tests to exercise all of the corner cases. > > + > + /* Worst case is that every byte of keys will be alpha, > + so every byte B will map to the sequence of 4 bytes [Bb]. */ Umm, is this always true? Consider the UTF-8 Turkish locale, where single-byte i has a multi-byte uppercase (and conversely the single-byte I has a multi-byte lowercase) - that is, 'i' and 'I' are not case pairs. > + else > + { > + *p++ = '['; > + *p++ = *keys; > + *p++ = islower (*keys) ? toupper (*keys) : tolower (*keys); This performs the ASCII-only toupper/tolower, rather than the proper locale-sensitive case conversion, which probably makes this patch misbehave for LC_ALL=tr_TR.UTF-8. > > + /* As currently implemented, case-insensitive matching is expensive in > + multi-byte locales because of a few outlier locales in which some > + characters change size when converted to upper or lower case. To > + accommodate those, we revert to searching the input one line at a > + time, rather than using the much more efficient buffer search. > + However, if we have a plain ascii search string, /foo/, we can > + convert it to an equivalent case-insensitive /[fF][oO][oO]/, and thus > + avoid the expensive read-and-process-a-line-at-a-time requirement. > + Optimize-away the "-i" option, when possible, converting each > + candidate alpha, C, in the regexp to [Cc]. */ In other words, this comment describes the very flaw of the outlier tr_TR locale that you are now violating with your optimization. Without your patch, this is correct behavior: $ echo i | LC_ALL=tr_TR.UTF-8 grep -i I $ echo i | LC_ALL=tr_TR.UTF-8 grep -i i i -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 11 years and 82 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16232 [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales