GNU bug report logs - #16481
dfa.c and Rational Range Interpretation

Previous Next

Package: grep;

Reported by: Aharon Robbins <arnold <at> skeeve.com>

Date: Fri, 17 Jan 2014 13:41:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #29 received at 16481 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: eggert <at> cs.ucla.edu, arnold <at> skeeve.com, 16481 <at> debbugs.gnu.org
Subject: Re: bug#16481: dfa.c and Rational Range Interpretation
Date: Sat, 25 Jan 2014 20:27:13 +0200
Hi Paul & Jim,

> > What happens if you compile them in and run the grep test suite?
>
> The test suite passes, but grep is bigger and (I presume) slower.  The 
> GREP-related changes are for performance, and shouldn't affect behavior.
>
> How about if we apply the attached patch to dfa.c, in both gawk and 
> grep?  I tried it just now, and gawk passed all its tests too.  Or, if 
> there's some reason this patch would introduce a bug into gawk, I'd like 
> to fix the grep test cases to detect the bug.

The code in question occurs in two functions, parse_bracket_exp() and atom().

The first instance is in parse_bracket_exp(), building a range expression,
where we may have multibyte characters.

      ....
      if (c1 == '-' && c2 != ']')
        {
          if (c2 == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
            FETCH_WC (c2, wc2, _("unbalanced ["));

          if (MB_CUR_MAX > 1)
            {
              /* When case folding map a range, say [m-z] (or even [M-z])
                 to the pair of ranges, [m-z] [M-Z].  */
              REALLOC_IF_NECESSARY (work_mbc->range_sts,
                                    range_sts_al, work_mbc->nranges + 1);
              REALLOC_IF_NECESSARY (work_mbc->range_ends,
                                    range_ends_al, work_mbc->nranges + 1);
              work_mbc->range_sts[work_mbc->nranges] =
                case_fold ? towlower (wc) : (wchar_t) wc;
              work_mbc->range_ends[work_mbc->nranges++] =
                case_fold ? towlower (wc2) : (wchar_t) wc2;

#ifndef GREP
              if (case_fold && (iswalpha (wc) || iswalpha (wc2)))
                {
                  REALLOC_IF_NECESSARY (work_mbc->range_sts,
                                        range_sts_al, work_mbc->nranges + 1);
                  work_mbc->range_sts[work_mbc->nranges] = towupper (wc);
                  REALLOC_IF_NECESSARY (work_mbc->range_ends,
                                        range_ends_al, work_mbc->nranges + 1);
                  work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2);
                }
#endif
            }

To me this looks like when doing case folding (grep -i, IGNORECASE in gawk),
we turn the m.b. equivalent of [a-c] into [a-cA-C].  This would seem to be
necessary for correctness, and the question is why does grep not need it?

The next such bit is later on in the same function:

      if (case_fold && iswalpha (wc))
        {
          wc = towlower (wc);
          if (!setbit_wc (wc, ccl))
            {
              REALLOC_IF_NECESSARY (work_mbc->chars, chars_al,
                                    work_mbc->nchars + 1);
              work_mbc->chars[work_mbc->nchars++] = wc;
            }
#ifdef GREP
          continue;
#else
          wc = towupper (wc);
#endif
        }
      if (!setbit_wc (wc, ccl))
        {
          REALLOC_IF_NECESSARY (work_mbc->chars, chars_al,
                                work_mbc->nchars + 1);
          work_mbc->chars[work_mbc->nchars++] = wc;
        }
    }
  while ((wc = wc1, (c = c1) != ']'));

This too looks related to case folding and ranges; if I read it
correctly, when case folding it added the lower case version and
now it has to add the uppercase version of the charcter.

Then, in atom():  (Why the bizarre leading `if (0)'?)

static void
atom (void)
{
  if (0)
    {
      /* empty */
    }
  else if (MBS_SUPPORT && tok == WCHAR)
    {
      addtok_wc (case_fold ? towlower (wctok) : wctok);
#ifndef GREP
      if (case_fold && iswalpha (wctok))
        {
          addtok_wc (towupper (wctok));
          addtok (OR);
        }
#endif

      tok = lex ();
    }

Here too, we're doing case folding, have added the lower case character
and need to add the upper case one.

I think to test out this code you'd need a character set where the lower
and upper case counterparts are multibyte characters and grep -i is
in effect.  But I suspect that grep has so much other code to special
case grep -i that this code in dfa.c is never reached.

In short, I don't think it's right to remove this code, but I don't
know how to test it to prove that, either.

HTH,

Arnold




This bug report was last modified 11 years and 132 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.