#16232 - [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Dec 2013 22:40:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #38 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com> To: Jim Meyering <jim <at> meyering.net> Cc: 16232 <16232 <at> debbugs.gnu.org> Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales Date: Sat, 11 Jan 2014 14:15:58 +0000

On 01/11/2014 11:33 AM, Pádraig Brady wrote: > On 01/11/2014 05:40 AM, Jim Meyering wrote: >> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote: >>>> I wonder might this faster path be restricted to a safer but very common input subset of: >>>> >>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80)) >>> >>> That sounds like a good approach. >>> Now I need another test case, to demonstrate that the current code can >>> cause trouble. >> >> Hmm... after thinking about this for a while and actually trying to >> break the current code (did not find a way to demonstrate a regression), >> I have concluded that the current approach is no worse than the prior >> one of matching a case-mapped regexp vs. each case-mapped input line. >> >> That's not to say that it's perfect, of course. >> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example >> from gnulib's test-ulc-casecmp.c is a great example: this matches: >> >> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf >> '\x6A\xCC\x8C\xCC\xA3')" >> >> but this does not, yet probably should: >> >> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')" >> >> Can you see a way to demonstrate a regression? > > Oh right, it doesn't handle these cases already. > Fair enough I don't see a regression then. This is also a good summary of stuff to consider with case: http://www.unicode.org/faq/casemap_charprop.html So picking another case situation from there: "in the Greek script, capital sigma (U+03A3) is the uppercase form of both the regular (U+03C2) and final (U+03C3) lowercase sigma." One can see that sed handles this: $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/' ςσΣΣ $ printf '\u03A3\n' | sed 's/.*/&\L&/' Σσ Though I was surprised the grep (2.14) didn't match any combo of these $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)" $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)" $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)" Not a regression of course. cheers, Pádraig.

This bug report was last modified 11 years and 82 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16232 [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales