GNU bug report logs -
#16481
dfa.c and Rational Range Interpretation
Previous Next
Reported by: Aharon Robbins <arnold <at> skeeve.com>
Date: Fri, 17 Jan 2014 13:41:01 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #29 received at 16481 <at> debbugs.gnu.org (full text, mbox):
Hi Paul & Jim,
> > What happens if you compile them in and run the grep test suite?
>
> The test suite passes, but grep is bigger and (I presume) slower. The
> GREP-related changes are for performance, and shouldn't affect behavior.
>
> How about if we apply the attached patch to dfa.c, in both gawk and
> grep? I tried it just now, and gawk passed all its tests too. Or, if
> there's some reason this patch would introduce a bug into gawk, I'd like
> to fix the grep test cases to detect the bug.
The code in question occurs in two functions, parse_bracket_exp() and atom().
The first instance is in parse_bracket_exp(), building a range expression,
where we may have multibyte characters.
....
if (c1 == '-' && c2 != ']')
{
if (c2 == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
FETCH_WC (c2, wc2, _("unbalanced ["));
if (MB_CUR_MAX > 1)
{
/* When case folding map a range, say [m-z] (or even [M-z])
to the pair of ranges, [m-z] [M-Z]. */
REALLOC_IF_NECESSARY (work_mbc->range_sts,
range_sts_al, work_mbc->nranges + 1);
REALLOC_IF_NECESSARY (work_mbc->range_ends,
range_ends_al, work_mbc->nranges + 1);
work_mbc->range_sts[work_mbc->nranges] =
case_fold ? towlower (wc) : (wchar_t) wc;
work_mbc->range_ends[work_mbc->nranges++] =
case_fold ? towlower (wc2) : (wchar_t) wc2;
#ifndef GREP
if (case_fold && (iswalpha (wc) || iswalpha (wc2)))
{
REALLOC_IF_NECESSARY (work_mbc->range_sts,
range_sts_al, work_mbc->nranges + 1);
work_mbc->range_sts[work_mbc->nranges] = towupper (wc);
REALLOC_IF_NECESSARY (work_mbc->range_ends,
range_ends_al, work_mbc->nranges + 1);
work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2);
}
#endif
}
To me this looks like when doing case folding (grep -i, IGNORECASE in gawk),
we turn the m.b. equivalent of [a-c] into [a-cA-C]. This would seem to be
necessary for correctness, and the question is why does grep not need it?
The next such bit is later on in the same function:
if (case_fold && iswalpha (wc))
{
wc = towlower (wc);
if (!setbit_wc (wc, ccl))
{
REALLOC_IF_NECESSARY (work_mbc->chars, chars_al,
work_mbc->nchars + 1);
work_mbc->chars[work_mbc->nchars++] = wc;
}
#ifdef GREP
continue;
#else
wc = towupper (wc);
#endif
}
if (!setbit_wc (wc, ccl))
{
REALLOC_IF_NECESSARY (work_mbc->chars, chars_al,
work_mbc->nchars + 1);
work_mbc->chars[work_mbc->nchars++] = wc;
}
}
while ((wc = wc1, (c = c1) != ']'));
This too looks related to case folding and ranges; if I read it
correctly, when case folding it added the lower case version and
now it has to add the uppercase version of the charcter.
Then, in atom(): (Why the bizarre leading `if (0)'?)
static void
atom (void)
{
if (0)
{
/* empty */
}
else if (MBS_SUPPORT && tok == WCHAR)
{
addtok_wc (case_fold ? towlower (wctok) : wctok);
#ifndef GREP
if (case_fold && iswalpha (wctok))
{
addtok_wc (towupper (wctok));
addtok (OR);
}
#endif
tok = lex ();
}
Here too, we're doing case folding, have added the lower case character
and need to add the upper case one.
I think to test out this code you'd need a character set where the lower
and upper case counterparts are multibyte characters and grep -i is
in effect. But I suspect that grep has so much other code to special
case grep -i that this code in dfa.c is never reached.
In short, I don't think it's right to remove this code, but I don't
know how to test it to prove that, either.
HTH,
Arnold
This bug report was last modified 11 years and 132 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.