GNU bug report logs - #16895
[PATCH] grep: fix multiple bugs with bracket expressions

Previous Next

Package: grep;

Reported by: Paul Eggert <eggert <at> cs.ucla.edu>

Date: Thu, 27 Feb 2014 17:35:01 UTC

Severity: normal

Tags: fixed, patch

Fixed in versions 16232, 16777

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Aharon Robbins <arnold <at> skeeve.com>
To: eggert <at> cs.ucla.edu, 16895 <at> debbugs.gnu.org
Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
Date: Thu, 27 Feb 2014 22:31:14 +0200
Hi Paul.

> Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
> To: 16895 <at> debbugs.gnu.org
> Date: Thu, 27 Feb 2014 09:34:33 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
>
> I'm afraid there are several problems in the dfa code.  I still don't 
> have a handle on all of them, but here's my first patch to deal with the 
> first major one I found.  Patterns like [a-[.z.]], which caused 'grep' 
> to dump core until recently, still aren't being handled correctly, and 
> there are several closely related bugs here.  I've taken the liberty of 
> pushing the attached patch.

Thanks. This looks promising. A few comments / questions.

> +/* Return true if the current locale is known to be a unibyte locale
> +   without multicharacter collating sequences and where range
> +   comparisons simply use the native encoding.  These locales can be
> +   processed more efficiently.  */
> +
> +static bool
> +using_simple_locale (void)
> +{
> +  /* True if the native character set is known to be compatible with
> +     the C locale.  The following test isn't perfect, but it's good
> +     enough in practice, as only ASCII and EBCDIC are in common use
> +     and this test correctly accepts ASCII and rejects EBCDIC.  */
> +  enum { native_c_charset =
> +    ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12
> +     && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35
> +     && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41
> +     && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46
> +     && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59
> +     && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65
> +     && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94
> +     && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124
> +     && '}' == 125 && '~' == 126)
> +  };

What a mouthful!  Is all that really necessary?

> +          if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I'd suggest parentheses around the bit with the bitwise operator,
both for readability and to match the rest of the code.

> @@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
>                /* Fetch bracket.  */
>                FETCH_WC (c, wc, _("unbalanced ["));
>                if (c1 == ':')
> -                /* build character class.  */
> +                /* Build character class.  POSIX allows character
> +                   classes to match multicharacter collating elements,
> +                   but the regex code does not support that, so do not
> +                   worry about that possibility.  */

I thought GLIBC did support them?

I will try this out in gawk, sometime in the next few days and
let you know how it goes.

Thanks for the work!

Arnold




This bug report was last modified 11 years and 84 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.