#16481 - dfa.c and Rational Range Interpretation

GNU bug report logs - #16481
dfa.c and Rational Range Interpretation

Package: grep;

Reported by: Aharon Robbins <arnold <at> skeeve.com>

Date: Fri, 17 Jan 2014 13:41:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System) To: Aharon Robbins <arnold <at> skeeve.com> Subject: bug#16481: closed (Re: dfa.c and Rational Range Interpretation) Date: Sun, 09 Mar 2014 20:08:02 +0000

[Message part 1 (text/plain, inline)]

Your bug report #16481: dfa.c and Rational Range Interpretation which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 16481 <at> debbugs.gnu.org. -- 16481: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16481 GNU Bug Tracking System Contact help-debbugs <at> gnu.org with problems

[Message part 2 (message/rfc822, inline)]

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16481-done <at> debbugs.gnu.org
Subject: Re:  dfa.c and Rational Range Interpretation
Date: Sun, 09 Mar 2014 13:07:12 -0700

It seems that the issues in this bug report are all done in the savannah 
git master for grep, so I'm marking this as done.

At some point I'd like to change the regex code to support RRI, at which 
point the dfa.c should now automatically adapt without our having to 
change dfa.c further.  But that would be a matter for a gnulib and/or 
glibc bug report, not dfa and/or grep.

[Message part 3 (message/rfc822, inline)]

From: Aharon Robbins <arnold <at> skeeve.com>
To: bug-grep <at> gnu.org
Subject: dfa.c and Rational Range Interpretation
Date: Fri, 17 Jan 2014 15:39:48 +0200

Hello All.

I believe that the code in dfa.c that deals with character ranges
is incorrect with respect to Rational Range Interpretation.
This shows up in the following test case:

	$ echo \\ | src/grep -Xawk '[\[-\]]'
	$ 

Whereas with gawk:

	$ echo \\ | gawk '/[\[-\]]/'
	\

From ascii(7):

      ...       133   91    5B    [
      ...       134   92    5C    \  '\\'
      ...       135   93    5D    ]

So gawk is correct here.  (This is on a GLIBC system; in private email
Jim reported different behavior on Mac OS X.)

In the grep master, the code in question is in dfa.c:parse_bracket_exp,
lines 1110 - 1135:

            {
              /* Defer to the system regex library about the meaning
                 of range expressions.  */
              regex_t re;
              char pattern[6] = { '[', 0, '-', 0, ']', 0 };
              char subject[2] = { 0, 0 };
              c1 = c;
              if (case_fold)
                {
                  c1 = tolower (c1);
                  c2 = tolower (c2);
                }

              pattern[1] = c1;
              pattern[3] = c2;
              regcomp (&re, pattern, REG_NOSUB);
              for (c = 0; c < NOTCHAR; ++c)
                {
                  if ((case_fold && isupper (c)))
                    continue;
                  subject[0] = c;
                  if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
                    setbit_case_fold_c (c, ccl);
                }
              regfree (&re);
            }

This code lets the regex routines decide what characters match
a particular range expression. If the regex routines are not
obeying RRI, then dfa.c will not either.  Yet, grep now supports RRI.

(To me this argues that grep's configure should be checking the
system regex routines for correct RRI support, and automatically
using the included routines if the system routines are not good. Gawk
goes further and simply always uses the included regex routines,
*guaranteeing* consistent behavior across systems. But that's a
parenthetical issue.)

In addition, the call to regcomp could fail, but this isn't being
checked. When I add an error report to the call, I get the following
on one of the gawk test cases:

  "[.c.]" ~ /[a-[.e.]]/ --> 1
  dfa.c:1176: regcomp(/[a-[]/) failed: Invalid range end

Since this relates to [. and .] which dfa and regex don't really
support, there's a gap somewhere, but the point is that if regcomp
fails, nobody notices. What does regexec do if regcomp fails?
Beats me...

Next, let's take a harder look at this:

              for (c = 0; c < NOTCHAR; ++c)
                {
                  if ((case_fold && isupper (c)))
                    continue;
                  subject[0] = c;
                  if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
                    setbit_case_fold_c (c, ccl);
                }

Since c is 0 on the first iteration, regexec is called with subject
equal to [ '\0' '\0' ].  The first thing regexec does is

	length = strlen(string);

which in this case will be zero. We really want a length of 1 where
the first byte is zero (no arbitrary limits, eh?).  Bug in the regexec
interface, methinks, but in any case, testing 0 is fruitless.

However, this code begs a deeper question. If we're doing RRI, then by
definition only the values between the low member of the range and
the high member of the range can match the range expression.  So why
loop over everything from 0 to 255?

Thus, gawk replaces the above code with the following:

              c1 = c;
              if (case_fold)
                {
                  c1 = tolower (c1);
                  c2 = tolower (c2);
                }
              for (c = c1; c <= c2; c++)
                setbit_case_fold_c (c, ccl);

This sets the bits for exactly those characters in the range. No more,
no less. And it doesn't rely on the system regex routines, which makes
compiling the dfa go faster.

Grep only compiles its dfa once, but gawk can compile arbitrarily many
dfa's, since it can match expressions that are computed dynamically.

I'm not sure if this analysis covers all the problems with the current
code.  But I do think that gawk's code is the correct thing to be
doing for RRI.

Additionally, I recommend that grep's configure check for good RRI
support in the system regex routines and switch to the included ones
if the system ones don't support it.

Finally, the following diff lets grep check the other awk syntax
variants.  Feel free to apply it. For the above test case, all three
give the same results.

I hope all this is of interest.

Thanks!

Arnold
-----------------------------------------------------
diff --git a/src/grep.c b/src/grep.c
index 1b2198f..12644a2 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -19,10 +19,24 @@ Acompile (char const *pattern, size_t size)
   GEAcompile (pattern, size, RE_SYNTAX_AWK);
 }
 
+static void
+GAcompile (char const *pattern, size_t size)
+{
+  GEAcompile (pattern, size, RE_SYNTAX_GNU_AWK);
+}
+
+static void
+PAcompile (char const *pattern, size_t size)
+{
+  GEAcompile (pattern, size, RE_SYNTAX_POSIX_AWK);
+}
+
 struct matcher const matchers[] = {
   { "grep",    Gcompile, EGexecute },
   { "egrep",   Ecompile, EGexecute },
   { "awk",     Acompile, EGexecute },
+  { "gawk",    GAcompile, EGexecute },
+  { "posixawk", PAcompile, EGexecute },
   { "fgrep",   Fcompile, Fexecute },
   { "perl",    Pcompile, Pexecute },
   { NULL, NULL, NULL },

This bug report was last modified 11 years and 134 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16481 dfa.c and Rational Range Interpretation

GNU bug report logs - #16481
dfa.c and Rational Range Interpretation