#16481 - dfa.c and Rational Range Interpretation

GNU bug report logs - #16481
dfa.c and Rational Range Interpretation

Package: grep;

Reported by: Aharon Robbins <arnold <at> skeeve.com>

Date: Fri, 17 Jan 2014 13:41:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #29 received at 16481 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com> To: eggert <at> cs.ucla.edu, arnold <at> skeeve.com, 16481 <at> debbugs.gnu.org Subject: Re: bug#16481: dfa.c and Rational Range Interpretation Date: Sat, 25 Jan 2014 20:27:13 +0200

Hi Paul & Jim, > > What happens if you compile them in and run the grep test suite? > > The test suite passes, but grep is bigger and (I presume) slower. The > GREP-related changes are for performance, and shouldn't affect behavior. > > How about if we apply the attached patch to dfa.c, in both gawk and > grep? I tried it just now, and gawk passed all its tests too. Or, if > there's some reason this patch would introduce a bug into gawk, I'd like > to fix the grep test cases to detect the bug. The code in question occurs in two functions, parse_bracket_exp() and atom(). The first instance is in parse_bracket_exp(), building a range expression, where we may have multibyte characters. .... if (c1 == '-' && c2 != ']') { if (c2 == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS)) FETCH_WC (c2, wc2, _("unbalanced [")); if (MB_CUR_MAX > 1) { /* When case folding map a range, say [m-z] (or even [M-z]) to the pair of ranges, [m-z] [M-Z]. */ REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); work_mbc->range_sts[work_mbc->nranges] = case_fold ? towlower (wc) : (wchar_t) wc; work_mbc->range_ends[work_mbc->nranges++] = case_fold ? towlower (wc2) : (wchar_t) wc2; #ifndef GREP if (case_fold && (iswalpha (wc) || iswalpha (wc2))) { REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); work_mbc->range_sts[work_mbc->nranges] = towupper (wc); REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2); } #endif } To me this looks like when doing case folding (grep -i, IGNORECASE in gawk), we turn the m.b. equivalent of [a-c] into [a-cA-C]. This would seem to be necessary for correctness, and the question is why does grep not need it? The next such bit is later on in the same function: if (case_fold && iswalpha (wc)) { wc = towlower (wc); if (!setbit_wc (wc, ccl)) { REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, work_mbc->nchars + 1); work_mbc->chars[work_mbc->nchars++] = wc; } #ifdef GREP continue; #else wc = towupper (wc); #endif } if (!setbit_wc (wc, ccl)) { REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, work_mbc->nchars + 1); work_mbc->chars[work_mbc->nchars++] = wc; } } while ((wc = wc1, (c = c1) != ']')); This too looks related to case folding and ranges; if I read it correctly, when case folding it added the lower case version and now it has to add the uppercase version of the charcter. Then, in atom(): (Why the bizarre leading `if (0)'?) static void atom (void) { if (0) { /* empty */ } else if (MBS_SUPPORT && tok == WCHAR) { addtok_wc (case_fold ? towlower (wctok) : wctok); #ifndef GREP if (case_fold && iswalpha (wctok)) { addtok_wc (towupper (wctok)); addtok (OR); } #endif tok = lex (); } Here too, we're doing case folding, have added the lower case character and need to add the upper case one. I think to test out this code you'd need a character set where the lower and upper case counterparts are multibyte characters and grep -i is in effect. But I suspect that grep has so much other code to special case grep -i that this code in dfa.c is never reached. In short, I don't think it's right to remove this code, but I don't know how to test it to prove that, either. HTH, Arnold

This bug report was last modified 11 years and 132 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16481 dfa.c and Rational Range Interpretation

GNU bug report logs - #16481
dfa.c and Rational Range Interpretation