GNU bug report logs -
#16895
[PATCH] grep: fix multiple bugs with bracket expressions
Previous Next
Reported by: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Thu, 27 Feb 2014 17:35:01 UTC
Severity: normal
Tags: fixed, patch
Fixed in versions 16232, 16777
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Hi Paul.
> Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
> To: 16895 <at> debbugs.gnu.org
> Date: Thu, 27 Feb 2014 09:34:33 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
>
> I'm afraid there are several problems in the dfa code. I still don't
> have a handle on all of them, but here's my first patch to deal with the
> first major one I found. Patterns like [a-[.z.]], which caused 'grep'
> to dump core until recently, still aren't being handled correctly, and
> there are several closely related bugs here. I've taken the liberty of
> pushing the attached patch.
Thanks. This looks promising. A few comments / questions.
> +/* Return true if the current locale is known to be a unibyte locale
> + without multicharacter collating sequences and where range
> + comparisons simply use the native encoding. These locales can be
> + processed more efficiently. */
> +
> +static bool
> +using_simple_locale (void)
> +{
> + /* True if the native character set is known to be compatible with
> + the C locale. The following test isn't perfect, but it's good
> + enough in practice, as only ASCII and EBCDIC are in common use
> + and this test correctly accepts ASCII and rejects EBCDIC. */
> + enum { native_c_charset =
> + ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12
> + && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35
> + && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41
> + && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46
> + && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59
> + && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65
> + && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94
> + && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124
> + && '}' == 125 && '~' == 126)
> + };
What a mouthful! Is all that really necessary?
> + if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I'd suggest parentheses around the bit with the bitwise operator,
both for readability and to match the rest of the code.
> @@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
> /* Fetch bracket. */
> FETCH_WC (c, wc, _("unbalanced ["));
> if (c1 == ':')
> - /* build character class. */
> + /* Build character class. POSIX allows character
> + classes to match multicharacter collating elements,
> + but the regex code does not support that, so do not
> + worry about that possibility. */
I thought GLIBC did support them?
I will try this out in gawk, sometime in the next few days and
let you know how it goes.
Thanks for the work!
Arnold
This bug report was last modified 11 years and 84 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.