GNU bug report logs -
#16895
[PATCH] grep: fix multiple bugs with bracket expressions
Previous Next
Reported by: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Thu, 27 Feb 2014 17:35:01 UTC
Severity: normal
Tags: fixed, patch
Fixed in versions 16232, 16777
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16895 in the body.
You can then email your comments to 16895 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16895
; Package
grep
.
(Thu, 27 Feb 2014 17:35:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Thu, 27 Feb 2014 17:35:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Tags: patch done
I'm afraid there are several problems in the dfa code. I still don't
have a handle on all of them, but here's my first patch to deal with the
first major one I found. Patterns like [a-[.z.]], which caused 'grep'
to dump core until recently, still aren't being handled correctly, and
there are several closely related bugs here. I've taken the liberty of
pushing the attached patch.
[0001-grep-fix-multiple-bugs-with-bracket-expressions.patch (text/x-patch, attachment)]
Added tag(s) fixed.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Thu, 27 Feb 2014 17:49:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16895
; Package
grep
.
(Thu, 27 Feb 2014 20:32:01 GMT)
Full text and
rfc822 format available.
Message #10 received at 16895 <at> debbugs.gnu.org (full text, mbox):
Hi Paul.
> Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
> To: 16895 <at> debbugs.gnu.org
> Date: Thu, 27 Feb 2014 09:34:33 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
>
> I'm afraid there are several problems in the dfa code. I still don't
> have a handle on all of them, but here's my first patch to deal with the
> first major one I found. Patterns like [a-[.z.]], which caused 'grep'
> to dump core until recently, still aren't being handled correctly, and
> there are several closely related bugs here. I've taken the liberty of
> pushing the attached patch.
Thanks. This looks promising. A few comments / questions.
> +/* Return true if the current locale is known to be a unibyte locale
> + without multicharacter collating sequences and where range
> + comparisons simply use the native encoding. These locales can be
> + processed more efficiently. */
> +
> +static bool
> +using_simple_locale (void)
> +{
> + /* True if the native character set is known to be compatible with
> + the C locale. The following test isn't perfect, but it's good
> + enough in practice, as only ASCII and EBCDIC are in common use
> + and this test correctly accepts ASCII and rejects EBCDIC. */
> + enum { native_c_charset =
> + ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12
> + && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35
> + && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41
> + && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46
> + && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59
> + && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65
> + && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94
> + && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124
> + && '}' == 125 && '~' == 126)
> + };
What a mouthful! Is all that really necessary?
> + if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I'd suggest parentheses around the bit with the bitwise operator,
both for readability and to match the rest of the code.
> @@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
> /* Fetch bracket. */
> FETCH_WC (c, wc, _("unbalanced ["));
> if (c1 == ':')
> - /* build character class. */
> + /* Build character class. POSIX allows character
> + classes to match multicharacter collating elements,
> + but the regex code does not support that, so do not
> + worry about that possibility. */
I thought GLIBC did support them?
I will try this out in gawk, sometime in the next few days and
let you know how it goes.
Thanks for the work!
Arnold
bug marked as fixed in version 16232 16777, send any further explanations to
16895 <at> debbugs.gnu.org and Paul Eggert <eggert <at> cs.ucla.edu>
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Thu, 27 Feb 2014 21:03:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16895
; Package
grep
.
(Thu, 27 Feb 2014 21:25:02 GMT)
Full text and
rfc822 format available.
Message #15 received at 16895 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/27/2014 12:31 PM, Aharon Robbins wrote:
> What a mouthful! Is all that really necessary?
>
You should have seen it before I trimmed it down; it listed every POSIX
character. I dunno, maybe it could be trimmed, but I was worried about
oddball character sets like the unibyte JIS character set that's like
ASCII but substitutes Yen-sign for '\', and a couple of other
substitutions like that. I figured better safe than sorry. No big deal
of course.
> I'd suggest parentheses around the bit with the bitwise operator, both
> for readability and to match the rest of the code.
Done, with the attached patch. Oh, and I fixed an xdigit buglet I found
too, in the second patch in the attachment.
>> >@@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
>> > /* Fetch bracket. */
>> > FETCH_WC (c, wc, _("unbalanced ["));
>> > if (c1 == ':')
>> >- /* build character class. */
>> >+ /* Build character class. POSIX allows character
>> >+ classes to match multicharacter collating elements,
>> >+ but the regex code does not support that, so do not
>> >+ worry about that possibility. */
> I thought GLIBC did support them?
Source code says no. That is, [[:alpha:]] never matches a
multicharacter collating sequence. [[=a=]] might do so, but [[:alpha:]]
doesn't. (Unless I'm reading the source code wrong, which is possible.
It's not documented either way, as far as I know.)
[grep.diff (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16895
; Package
grep
.
(Fri, 28 Feb 2014 12:39:01 GMT)
Full text and
rfc822 format available.
Message #18 received at 16895 <at> debbugs.gnu.org (full text, mbox):
Hi Paul.
> Date: Thu, 27 Feb 2014 13:24:53 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Organization: UCLA Computer Science Department
> To: Aharon Robbins <arnold <at> skeeve.com>, 16895 <at> debbugs.gnu.org
> Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
OK - I tried out that patch (+ the two successors) in gawk and
it works fine, even causing a test that failed to now succeed (since
it falls back to regex). I've merged and pushed the change.
I definitely owe you some beer for this one. :-)
> On 02/27/2014 12:31 PM, Aharon Robbins wrote:
> > What a mouthful! Is all that really necessary?
>
> You should have seen it before I trimmed it down; it listed every POSIX
> character. I dunno, maybe it could be trimmed, but I was worried about
> oddball character sets like the unibyte JIS character set that's like
> ASCII but substitutes Yen-sign for '\', and a couple of other
> substitutions like that. I figured better safe than sorry. No big deal
> of course.
Is that done at compile time in those locales, or at run time? What
you've put in is a compile time check. I ask out of total ignorance
and am wondering how it works.
> >> >- /* build character class. */
> >> >+ /* Build character class. POSIX allows character
> >> >+ classes to match multicharacter collating elements,
> >> >+ but the regex code does not support that, so do not
> >> >+ worry about that possibility. */
> >
> > I thought GLIBC did support them?
>
> Source code says no. That is, [[:alpha:]] never matches a
> multicharacter collating sequence. [[=a=]] might do so, but [[:alpha:]]
> doesn't. (Unless I'm reading the source code wrong, which is possible.
> It's not documented either way, as far as I know.)
Ah. I misunderstood the context. GLIBC does support [[=a=]] and [[.ch.]],
though, right?
Thanks!
Arnold
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16895
; Package
grep
.
(Fri, 28 Feb 2014 21:22:01 GMT)
Full text and
rfc822 format available.
Message #21 received at 16895 <at> debbugs.gnu.org (full text, mbox):
On 02/28/2014 04:37 AM, Aharon Robbins wrote:
> Is that done at compile time in those locales, or at run time?
A bit of both. At compile-time we check that the compile-time character
set is compatible with ASCII (i.e., the C aka POSIX locale). At
run-time we check that the locale is indeed C aka POSIX. Both checks
need to succeed..
> GLIBC does support [[=a=]] and [[.ch.]], though, right?
>
Yes.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 29 Mar 2014 11:24:09 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 years and 83 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.