#16481 - dfa.c and Rational Range Interpretation

GNU bug report logs - #16481
dfa.c and Rational Range Interpretation

Package: grep;

Reported by: Aharon Robbins <arnold <at> skeeve.com>

Date: Fri, 17 Jan 2014 13:41:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #8 received at 16481 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Aharon Robbins <arnold <at> skeeve.com>, 16481 <at> debbugs.gnu.org Subject: Re: bug#16481: dfa.c and Rational Range Interpretation Date: Fri, 17 Jan 2014 14:43:29 -0800

Thanks for continuing to bird-dog this. On 01/17/2014 05:39 AM, Aharon Robbins wrote: > the following diff lets grep check the other awk syntax > variants. Feel free to apply it. I did that (the first patch enclosed below). Thanks. > I do think that gawk's code is the correct thing to be doing for RRI. I agree, and installed the second patch enclosed below to implement this. This patch also includes some documentation changes -- if you have a bit of time to review them I'd appreciate it. Also, I notice that there are a few "#ifdef GREP"s in dfa.c Do you happen to know why they're needed? It'd be nice if we could simplify dfa.c to omit the need for the GREP macro. > Additionally, I recommend that grep's configure check for good RRI > support in the system regex routines and switch to the included ones > if the system ones don't support it. Unfortunately that'd break support for equivalence classes and multibyte collation symbols on GNU/Linux platforms, so it may be a bridge too far. Until we get glibc fixed, I think it's OK to live with the situation where [a-z] ordinarily has the rational range interpretation, and this breaks down only for complicated matches where the DFA doesn't suffice; at least it'll work in the usual case. From c862ced6f31f0ccdf2505ac46e354a1a011149cd Mon Sep 17 00:00:00 2001 From: Aharon Robbins <arnold <at> skeeve.com> Date: Fri, 17 Jan 2014 12:42:49 -0800 Subject: [PATCH 1/2] grep: add undocumented '-X gawk' and '-X posixawk' options See <http://bugs.gnu.org/16481>. * src/grep.c (GAcompile, PAcompile): New functions. (const): Use them. --- src/grep.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/src/grep.c b/src/grep.c index 1b2198f..12644a2 100644 --- a/src/grep.c +++ b/src/grep.c @@ -19,10 +19,24 @@ Acompile (char const *pattern, size_t size) GEAcompile (pattern, size, RE_SYNTAX_AWK); } +static void +GAcompile (char const *pattern, size_t size) +{ + GEAcompile (pattern, size, RE_SYNTAX_GNU_AWK); +} + +static void +PAcompile (char const *pattern, size_t size) +{ + GEAcompile (pattern, size, RE_SYNTAX_POSIX_AWK); +} + struct matcher const matchers[] = { { "grep", Gcompile, EGexecute }, { "egrep", Ecompile, EGexecute }, { "awk", Acompile, EGexecute }, + { "gawk", GAcompile, EGexecute }, + { "posixawk", PAcompile, EGexecute }, { "fgrep", Fcompile, Fexecute }, { "perl", Pcompile, Pexecute }, { NULL, NULL, NULL }, -- 1.8.4.2 From aba2c718908d6c8fcfd75d55a43a4c9b1e3405a3 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert <at> cs.ucla.edu> Date: Fri, 17 Jan 2014 14:32:10 -0800 Subject: [PATCH 2/2] grep: DFA now uses rational ranges in unibyte locales Problem reported by Aharon Robbins in <http://bugs.gnu.org/16481>. * NEWS: * doc/grep.texi (Environment Variables) (Character Classes and Bracket Expressions): Document this. * src/dfa.c (parse_bracket_exp): Treat unibyte locales like multibyte. --- NEWS | 8 ++++++++ doc/grep.texi | 19 +++++++++---------- src/dfa.c | 20 ++------------------ 3 files changed, 19 insertions(+), 28 deletions(-) diff --git a/NEWS b/NEWS index 6e46684..589b2ac 100644 --- a/NEWS +++ b/NEWS @@ -7,6 +7,14 @@ GNU grep NEWS -*- outline -*- grep -i in a multibyte locale is now typically 10 times faster for patterns that do not contain \ or [. + Range expressions in unibyte locales now ordinarily use the rational + range interpretation, in which [a-z] matches only lower-case ASCII + letters regardless of locale, and similarly for other ranges. (This + was already true for multibyte locales.) Portable programs should + continue to specify the C locale when using range expressions, since + these expressions have unspecified behavior in non-GNU systems and + are not yet guaranteed to use the rational range interpretation even + in GNU systems. * Noteworthy changes in release 2.16 (2014-01-01) [stable] diff --git a/doc/grep.texi b/doc/grep.texi index 473a181..42fb9a2 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -960,8 +960,8 @@ They are omitted (i.e., false) by default and become true when specified. @cindex national language support @cindex NLS These variables specify the locale for the @code{LC_COLLATE} category, -which determines the collating sequence -used to interpret range expressions like @samp{[a-z]}. +which might affect how range expressions like @samp{[a-z]} are +interpreted. @item LC_ALL @itemx LC_CTYPE @@ -1223,14 +1223,13 @@ For example, the regular expression Within a bracket expression, a @dfn{range expression} consists of two characters separated by a hyphen. It matches any single character that -sorts between the two characters, inclusive, using the locale's -collating sequence and character set. -For example, in the default C -locale, @samp{[a-d]} is equivalent to @samp{[abcd]}. -Many locales sort -characters in dictionary order, and in these locales @samp{[a-d]} is -typically not equivalent to @samp{[abcd]}; -it might be equivalent to @samp{[aBbCcDd]}, for example. +sorts between the two characters, inclusive. +In the default C locale, the sorting sequence is the native character +order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}. +In other locales, the sorting sequence is not specified, and +@samp{[a-d]} might be equivalent to @samp{[abcd]} or to +@samp{[aBbCcDd]}, or it might fail to match any character, or the set of +characters that it matches might even be erratic. To obtain the traditional interpretation of bracket expressions, you can use the @samp{C} locale by setting the @env{LC_ALL} environment variable to the value @samp{C}. diff --git a/src/dfa.c b/src/dfa.c index 6ab4e05..5e3140d 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -1108,30 +1108,14 @@ parse_bracket_exp (void) } else { - /* Defer to the system regex library about the meaning - of range expressions. */ - regex_t re; - char pattern[6] = { '[', 0, '-', 0, ']', 0 }; - char subject[2] = { 0, 0 }; c1 = c; if (case_fold) { c1 = tolower (c1); c2 = tolower (c2); } - - pattern[1] = c1; - pattern[3] = c2; - regcomp (&re, pattern, REG_NOSUB); - for (c = 0; c < NOTCHAR; ++c) - { - if ((case_fold && isupper (c))) - continue; - subject[0] = c; - if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH) - setbit_case_fold_c (c, ccl); - } - regfree (&re); + for (c = c1; c <= c2; c++) + setbit_case_fold_c (c, ccl); } colon_warning_state |= 8; -- 1.8.4.2

This bug report was last modified 11 years and 132 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16481 dfa.c and Rational Range Interpretation

GNU bug report logs - #16481
dfa.c and Rational Range Interpretation