GNU bug report logs - #16895
[PATCH] grep: fix multiple bugs with bracket expressions

Previous Next

Package: grep;

Reported by: Paul Eggert <eggert <at> cs.ucla.edu>

Date: Thu, 27 Feb 2014 17:35:01 UTC

Severity: normal

Tags: fixed, patch

Fixed in versions 16232, 16777

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16895 in the body.
You can then email your comments to 16895 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#16895; Package grep. (Thu, 27 Feb 2014 17:35:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Eggert <eggert <at> cs.ucla.edu>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 27 Feb 2014 17:35:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-grep <at> gnu.org
Subject: [PATCH] grep: fix multiple bugs with bracket expressions
Date: Thu, 27 Feb 2014 09:34:33 -0800

[Message part 1 (text/plain, inline)]

Tags: patch done

I'm afraid there are several problems in the dfa code.  I still don't 
have a handle on all of them, but here's my first patch to deal with the 
first major one I found.  Patterns like [a-[.z.]], which caused 'grep' 
to dump core until recently, still aren't being handled correctly, and 
there are several closely related bugs here.  I've taken the liberty of 
pushing the attached patch.

[0001-grep-fix-multiple-bugs-with-bracket-expressions.patch (text/x-patch, attachment)]

Added tag(s) fixed. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Thu, 27 Feb 2014 17:49:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#16895; Package grep. (Thu, 27 Feb 2014 20:32:01 GMT) Full text and rfc822 format available.

Message #10 received at 16895 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: eggert <at> cs.ucla.edu, 16895 <at> debbugs.gnu.org
Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket
 expressions
Date: Thu, 27 Feb 2014 22:31:14 +0200

Hi Paul.

> Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
> To: 16895 <at> debbugs.gnu.org
> Date: Thu, 27 Feb 2014 09:34:33 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
>
> I'm afraid there are several problems in the dfa code.  I still don't 
> have a handle on all of them, but here's my first patch to deal with the 
> first major one I found.  Patterns like [a-[.z.]], which caused 'grep' 
> to dump core until recently, still aren't being handled correctly, and 
> there are several closely related bugs here.  I've taken the liberty of 
> pushing the attached patch.

Thanks. This looks promising. A few comments / questions.

> +/* Return true if the current locale is known to be a unibyte locale
> +   without multicharacter collating sequences and where range
> +   comparisons simply use the native encoding.  These locales can be
> +   processed more efficiently.  */
> +
> +static bool
> +using_simple_locale (void)
> +{
> +  /* True if the native character set is known to be compatible with
> +     the C locale.  The following test isn't perfect, but it's good
> +     enough in practice, as only ASCII and EBCDIC are in common use
> +     and this test correctly accepts ASCII and rejects EBCDIC.  */
> +  enum { native_c_charset =
> +    ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12
> +     && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35
> +     && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41
> +     && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46
> +     && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59
> +     && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65
> +     && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94
> +     && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124
> +     && '}' == 125 && '~' == 126)
> +  };

What a mouthful!  Is all that really necessary?

> +          if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I'd suggest parentheses around the bit with the bitwise operator,
both for readability and to match the rest of the code.

> @@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
>                /* Fetch bracket.  */
>                FETCH_WC (c, wc, _("unbalanced ["));
>                if (c1 == ':')
> -                /* build character class.  */
> +                /* Build character class.  POSIX allows character
> +                   classes to match multicharacter collating elements,
> +                   but the regex code does not support that, so do not
> +                   worry about that possibility.  */

I thought GLIBC did support them?

I will try this out in gawk, sometime in the next few days and
let you know how it goes.

Thanks for the work!

Arnold

bug marked as fixed in version 16232 16777, send any further explanations to 16895 <at> debbugs.gnu.org and Paul Eggert <eggert <at> cs.ucla.edu> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Thu, 27 Feb 2014 21:03:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#16895; Package grep. (Thu, 27 Feb 2014 21:25:02 GMT) Full text and rfc822 format available.

Message #15 received at 16895 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aharon Robbins <arnold <at> skeeve.com>, 16895 <at> debbugs.gnu.org
Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket
 expressions
Date: Thu, 27 Feb 2014 13:24:53 -0800

[Message part 1 (text/plain, inline)]

On 02/27/2014 12:31 PM, Aharon Robbins wrote:
> What a mouthful!  Is all that really necessary?
>

You should have seen it before I trimmed it down; it listed every POSIX 
character.  I dunno, maybe it could be trimmed, but I was worried about 
oddball character sets like the unibyte JIS character set that's like 
ASCII but substitutes Yen-sign for '\', and a couple of other 
substitutions like that.  I figured better safe than sorry.  No big deal 
of course.

> I'd suggest parentheses around the bit with the bitwise operator, both 
> for readability and to match the rest of the code. 

Done, with the attached patch.  Oh, and I fixed an xdigit buglet I found 
too, in the second patch in the attachment.

>> >@@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
>> >                /* Fetch bracket.  */
>> >                FETCH_WC (c, wc, _("unbalanced ["));
>> >                if (c1 == ':')
>> >-                /* build character class.  */
>> >+                /* Build character class.  POSIX allows character
>> >+                   classes to match multicharacter collating elements,
>> >+                   but the regex code does not support that, so do not
>> >+                   worry about that possibility.  */
> I thought GLIBC did support them?
Source code says no.  That is, [[:alpha:]] never matches a 
multicharacter collating sequence.  [[=a=]] might do so, but [[:alpha:]] 
doesn't.  (Unless I'm reading the source code wrong, which is possible.  
It's not documented either way, as far as I know.)

[grep.diff (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16895; Package grep. (Fri, 28 Feb 2014 12:39:01 GMT) Full text and rfc822 format available.

Message #18 received at 16895 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: eggert <at> cs.ucla.edu, 16895 <at> debbugs.gnu.org
Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket
 expressions
Date: Fri, 28 Feb 2014 14:37:59 +0200

Hi Paul.

> Date: Thu, 27 Feb 2014 13:24:53 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Organization: UCLA Computer Science Department
> To: Aharon Robbins <arnold <at> skeeve.com>, 16895 <at> debbugs.gnu.org
> Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions

OK - I tried out that patch (+ the two successors) in gawk and
it works fine, even causing a test that failed to now succeed (since
it falls back to regex).  I've merged and pushed the change.
I definitely owe you some beer for this one. :-)

> On 02/27/2014 12:31 PM, Aharon Robbins wrote:
> > What a mouthful!  Is all that really necessary?
>
> You should have seen it before I trimmed it down; it listed every POSIX 
> character.  I dunno, maybe it could be trimmed, but I was worried about 
> oddball character sets like the unibyte JIS character set that's like 
> ASCII but substitutes Yen-sign for '\', and a couple of other 
> substitutions like that.  I figured better safe than sorry.  No big deal 
> of course.

Is that done at compile time in those locales, or at run time? What
you've put in is a compile time check.  I ask out of total ignorance
and am wondering how it works.

> >> >-                /* build character class.  */
> >> >+                /* Build character class.  POSIX allows character
> >> >+                   classes to match multicharacter collating elements,
> >> >+                   but the regex code does not support that, so do not
> >> >+                   worry about that possibility.  */
> >
> > I thought GLIBC did support them?
>
> Source code says no.  That is, [[:alpha:]] never matches a 
> multicharacter collating sequence.  [[=a=]] might do so, but [[:alpha:]] 
> doesn't.  (Unless I'm reading the source code wrong, which is possible.  
> It's not documented either way, as far as I know.)

Ah. I misunderstood the context. GLIBC does support [[=a=]] and [[.ch.]],
though, right?

Thanks!

Arnold

Information forwarded to bug-grep <at> gnu.org:
bug#16895; Package grep. (Fri, 28 Feb 2014 21:22:01 GMT) Full text and rfc822 format available.

Message #21 received at 16895 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aharon Robbins <arnold <at> skeeve.com>, 16895 <at> debbugs.gnu.org
Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket
 expressions
Date: Fri, 28 Feb 2014 13:21:33 -0800

On 02/28/2014 04:37 AM, Aharon Robbins wrote:
> Is that done at compile time in those locales, or at run time?

A bit of both.  At compile-time we check that the compile-time character 
set is compatible with ASCII (i.e., the C aka POSIX locale).  At 
run-time we check that the locale is indeed C aka POSIX.  Both checks 
need to succeed..

> GLIBC does support [[=a=]] and [[.ch.]], though, right?
>

Yes.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 29 Mar 2014 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 83 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16895 [PATCH] grep: fix multiple bugs with bracket expressions

GNU bug report logs - #16895
[PATCH] grep: fix multiple bugs with bracket expressions