From unknown Wed Jun 18 23:13:41 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#16895 <16895@debbugs.gnu.org> To: bug#16895 <16895@debbugs.gnu.org> Subject: Status: [PATCH] grep: fix multiple bugs with bracket expressions Reply-To: bug#16895 <16895@debbugs.gnu.org> Date: Thu, 19 Jun 2025 06:13:41 +0000 retitle 16895 [PATCH] grep: fix multiple bugs with bracket expressions reassign 16895 grep submitter 16895 Paul Eggert severity 16895 normal tag 16895 fixed patch thanks From debbugs-submit-bounces@debbugs.gnu.org Thu Feb 27 12:35:00 2014 Received: (at submit) by debbugs.gnu.org; 27 Feb 2014 17:35:00 +0000 Received: from localhost ([127.0.0.1]:42907 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ4rL-0003gD-Dj for submit@debbugs.gnu.org; Thu, 27 Feb 2014 12:35:00 -0500 Received: from eggs.gnu.org ([208.118.235.92]:51937) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ4rH-0003g0-Q8 for submit@debbugs.gnu.org; Thu, 27 Feb 2014 12:34:57 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WJ4rC-0002ss-I8 for submit@debbugs.gnu.org; Thu, 27 Feb 2014 12:34:55 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:59473) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WJ4rC-0002so-EE for submit@debbugs.gnu.org; Thu, 27 Feb 2014 12:34:50 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53627) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WJ4r7-0005jO-Ji for bug-grep@gnu.org; Thu, 27 Feb 2014 12:34:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WJ4r2-0002rG-SV for bug-grep@gnu.org; Thu, 27 Feb 2014 12:34:45 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:38855) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WJ4r2-0002r0-F3 for bug-grep@gnu.org; Thu, 27 Feb 2014 12:34:40 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 25C6FA60006 for ; Thu, 27 Feb 2014 09:34:39 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 63rZ0Ysuzjjj for ; Thu, 27 Feb 2014 09:34:37 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 6A635A60001 for ; Thu, 27 Feb 2014 09:34:37 -0800 (PST) Message-ID: <530F7729.4080705@cs.ucla.edu> Date: Thu, 27 Feb 2014 09:34:33 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: bug-grep@gnu.org Subject: [PATCH] grep: fix multiple bugs with bracket expressions Content-Type: multipart/mixed; boundary="------------020109000801080505030300" X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) This is a multi-part message in MIME format. --------------020109000801080505030300 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Tags: patch done I'm afraid there are several problems in the dfa code. I still don't have a handle on all of them, but here's my first patch to deal with the first major one I found. Patterns like [a-[.z.]], which caused 'grep' to dump core until recently, still aren't being handled correctly, and there are several closely related bugs here. I've taken the liberty of pushing the attached patch. --------------020109000801080505030300 Content-Type: text/x-patch; name="0001-grep-fix-multiple-bugs-with-bracket-expressions.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename*0="0001-grep-fix-multiple-bugs-with-bracket-expressions.patch" >From f11f0c9351fdd2bd65efdb469754096d1a237d61 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 27 Feb 2014 09:26:23 -0800 Subject: [PATCH] grep: fix multiple bugs with bracket expressions * NEWS: Document this. * src/dfa.c (using_simple_locale): New function. (parse_bracket_exp): Handle bracket expressions like [a-[.z.]] correctly. Don't assume that dfaexec handles expressions like [^a-z] correctly, as they can match multiple characters in some locales. * tests/posix-bracket: New file. * tests/Makefile.am (TESTS): Add it. --- NEWS | 4 ++ src/dfa.c | 129 +++++++++++++++++++++++++++++----------------------- tests/Makefile.am | 1 + tests/posix-bracket | 33 ++++++++++++++ 4 files changed, 110 insertions(+), 57 deletions(-) create mode 100755 tests/posix-bracket diff --git a/NEWS b/NEWS index 657f3d1..6cfcaba 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,10 @@ GNU grep NEWS -*- outline -*- ** Bug fixes + grep no longer mishandles patterns like [a-[.z.]], and no longer + mishandles patterns like [^a] in locales that have multicharacter + collating sequences so that [^a] can match a string of two characters. + grep -P now works with -w and -x and backreferences. Before, echo aa|grep -Pw '(.)\1' would fail to match, yet echo aa|grep -Pw '(.)\2' would match. diff --git a/src/dfa.c b/src/dfa.c index 8906ed3..65ab5d6 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -182,7 +182,8 @@ enum EMPTY = NOTCHAR, /* EMPTY is a terminal symbol that matches the empty string. */ - BACKREF, /* BACKREF is generated by \; it + BACKREF, /* BACKREF is generated by \ + or by any other construct that is not completely handled. If the scanner detects a transition on backref, it returns a kind of "semi-success" indicating that @@ -769,6 +770,45 @@ using_utf8 (void) return utf8; } +/* Return true if the current locale is known to be a unibyte locale + without multicharacter collating sequences and where range + comparisons simply use the native encoding. These locales can be + processed more efficiently. */ + +static bool +using_simple_locale (void) +{ + /* True if the native character set is known to be compatible with + the C locale. The following test isn't perfect, but it's good + enough in practice, as only ASCII and EBCDIC are in common use + and this test correctly accepts ASCII and rejects EBCDIC. */ + enum { native_c_charset = + ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12 + && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35 + && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41 + && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46 + && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59 + && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65 + && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94 + && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124 + && '}' == 125 && '~' == 126) + }; + + if (! native_c_charset || MB_CUR_MAX > 1) + return false; + else + { + static int unibyte_c = -1; + if (unibyte_c < 0) + { + char *locale = setlocale (LC_ALL, 0); + unibyte_c = (locale && (STREQ (locale, "C") + || STREQ (locale, "POSIX"))); + } + return unibyte_c; + } +} + /* Lexical analyzer. All the dross that deals with the obnoxious GNU Regex syntax bits is located here. The poor, suffering reader is referred to the GNU Regex documentation for the @@ -917,6 +957,10 @@ parse_bracket_exp (void) int c, c1, c2; charclass ccl; + /* True if this is a bracket expression that dfaexec is known to + process correctly. */ + bool known_bracket_exp = true; + /* Used to warn about [:space:]. Bit 0 = first character is a colon. Bit 1 = last character is a colon. @@ -958,6 +1002,7 @@ parse_bracket_exp (void) { FETCH_WC (c, wc, _("unbalanced [")); invert = 1; + known_bracket_exp = using_simple_locale (); } else invert = 0; @@ -972,16 +1017,14 @@ parse_bracket_exp (void) we just treat it as a bunch of ordinary characters. We can do this because we assume regex has checked for syntax errors before dfa is ever called. */ - if (c == '[' && (syntax_bits & RE_CHAR_CLASSES)) + if (c == '[') { #define MAX_BRACKET_STRING_LEN 32 char str[MAX_BRACKET_STRING_LEN + 1]; FETCH_WC (c1, wc1, _("unbalanced [")); - /* If pattern contains '[[:', '[[.', or '[[='. */ - if (c1 == ':' - /* TODO: handle '[[.' and '[[=' also for MB_CUR_MAX == 1. */ - || (MB_CUR_MAX > 1 && (c1 == '.' || c1 == '='))) + if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES) + || c1 == '.' || c1 == '=') { size_t len = 0; for (;;) @@ -1000,7 +1043,10 @@ parse_bracket_exp (void) /* Fetch bracket. */ FETCH_WC (c, wc, _("unbalanced [")); if (c1 == ':') - /* build character class. */ + /* Build character class. POSIX allows character + classes to match multicharacter collating elements, + but the regex code does not support that, so do not + worry about that possibility. */ { char const *class = (case_fold && (STREQ (str, "upper") @@ -1024,28 +1070,9 @@ parse_bracket_exp (void) if (pred->func (c2)) setbit_case_fold_c (c2, ccl); } + else + known_bracket_exp = false; - else if (MBS_SUPPORT && (c1 == '=' || c1 == '.')) - { - char *elem = xmemdup (str, len + 1); - - if (c1 == '=') - /* build equivalence class. */ - { - REALLOC_IF_NECESSARY (work_mbc->equivs, - equivs_al, work_mbc->nequivs + 1); - work_mbc->equivs[work_mbc->nequivs++] = elem; - } - - if (c1 == '.') - /* build collating element. */ - { - REALLOC_IF_NECESSARY (work_mbc->coll_elems, - coll_elems_al, - work_mbc->ncoll_elems + 1); - work_mbc->coll_elems[work_mbc->ncoll_elems++] = elem; - } - } colon_warning_state |= 8; /* Fetch new lookahead character. */ @@ -1067,6 +1094,16 @@ parse_bracket_exp (void) /* build range characters. */ { FETCH_WC (c2, wc2, _("unbalanced [")); + + /* A bracket expression like [a-[.aa.]] matches an unknown set. + Treat it like [-a[.aa.]] while parsing it, and + remember that the set is unknown. */ + if (c2 == '[' && *lexptr == '.') + { + known_bracket_exp = false; + c2 = ']'; + } + if (c2 == ']') { /* In the case [x-], the - is an ordinary hyphen, @@ -1104,36 +1141,11 @@ parse_bracket_exp (void) work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2); } } + else if (using_simple_locale ()) + for (; c <= c2; c++) + setbit_case_fold_c (c, ccl); else - { - /* Defer to the system regex library about the meaning - of range expressions. */ - struct re_pattern_buffer re = { 0 }; - char const *compile_msg; -#if 199901 <= __STDC_VERSION__ - char pattern[] = { '[', '\\', c, '-', '\\', c2, ']' }; -#else - char pattern[] = { '[', '\\', 0, '-', '\\', 0, ']' }; - pattern[2] = c; - pattern[5] = c2; -#endif - re_set_syntax (syntax_bits | RE_BACKSLASH_ESCAPE_IN_LISTS); - compile_msg = re_compile_pattern (pattern, sizeof pattern, &re); - if (compile_msg) - dfaerror (compile_msg); - for (c = 0; c < NOTCHAR; c++) - { - char subject = c; - switch (re_match (&re, &subject, 1, 0, NULL)) - { - case 1: setbit (c, ccl); break; - case -1: break; - default: xalloc_die (); - } - } - regfree (&re); - re_set_syntax (syntax_bits); - } + known_bracket_exp = false; colon_warning_state |= 8; FETCH_WC (c1, wc1, _("unbalanced [")); @@ -1171,6 +1183,9 @@ parse_bracket_exp (void) if (colon_warning_state == 7) dfawarn (_("character class syntax is [[:space:]], not [:space:]")); + if (! known_bracket_exp) + return BACKREF; + if (MB_CUR_MAX > 1) { static charclass zeroclass; diff --git a/tests/Makefile.am b/tests/Makefile.am index 742a580..972ffc5 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -86,6 +86,7 @@ TESTS = \ pcre-w \ pcre-wx-backref \ pcre-z \ + posix-bracket \ prefix-of-multibyte \ r-dot \ repetition-overflow \ diff --git a/tests/posix-bracket b/tests/posix-bracket new file mode 100755 index 0000000..d9d1d84 --- /dev/null +++ b/tests/posix-bracket @@ -0,0 +1,33 @@ +#!/bin/sh +# Check various bracket expressions in the POSIX locale. + +# Copyright 2014 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +. "${srcdir=.}/init.sh"; path_prepend_ ../src +LC_ALL=C +export LC_ALL + +fail=0 + +echo a >in || framework_failure_ +for bracketed in '[.a.]' '[.a.]-a' 'a-[.a.]' '[.a.]-[.a.]' \ + '[=a=]' '[:alpha:]'; do + grep "[$bracketed]" in >out || fail=1 + compare in out || fail=1 + grep "[^$bracketed]" in >out && fail=1 + compare /dev/null out || fail=1 +done +Exit $fail -- 1.8.5.3 --------------020109000801080505030300-- From debbugs-submit-bounces@debbugs.gnu.org Thu Feb 27 12:48:01 2014 Received: (at control) by debbugs.gnu.org; 27 Feb 2014 17:48:01 +0000 Received: from localhost ([127.0.0.1]:42953 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ53x-00047r-Kb for submit@debbugs.gnu.org; Thu, 27 Feb 2014 12:48:01 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:49154) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ53w-00047e-Gw for control@debbugs.gnu.org; Thu, 27 Feb 2014 12:48:00 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 07A80A60006 for ; Thu, 27 Feb 2014 09:48:00 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AgeO2lVPnMi0 for ; Thu, 27 Feb 2014 09:47:59 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id BF70EA60002 for ; Thu, 27 Feb 2014 09:47:59 -0800 (PST) Message-ID: <530F7A4F.3080801@cs.ucla.edu> Date: Thu, 27 Feb 2014 09:47:59 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: Re: 16895 is done References: <530F79FD.9030006@cs.ucla.edu> In-Reply-To: <530F79FD.9030006@cs.ucla.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) tags 16895 fixed From debbugs-submit-bounces@debbugs.gnu.org Thu Feb 27 15:31:22 2014 Received: (at 16895) by debbugs.gnu.org; 27 Feb 2014 20:31:22 +0000 Received: from localhost ([127.0.0.1]:43076 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ7c2-0002s9-68 for submit@debbugs.gnu.org; Thu, 27 Feb 2014 15:31:22 -0500 Received: from mxout5.netvision.net.il ([194.90.6.65]:59896) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ7by-0002rv-Q5 for 16895@debbugs.gnu.org; Thu, 27 Feb 2014 15:31:20 -0500 MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from skeeve.com ([93.172.103.69]) by mxout5.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTPS id <0N1O00JT7904M821@mxout5.netvision.net.il> for 16895@debbugs.gnu.org; Thu, 27 Feb 2014 22:31:17 +0200 (IST) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.14.4/8.14.4/Debian-2ubuntu2.1) with ESMTP id s1RKVFKp003504; Thu, 27 Feb 2014 22:31:15 +0200 Received: (from arnold@localhost) by skeeve.com (8.14.4/8.14.4/Submit) id s1RKVEM3003503; Thu, 27 Feb 2014 22:31:14 +0200 From: Aharon Robbins Message-id: <201402272031.s1RKVEM3003503@skeeve.com> Date: Thu, 27 Feb 2014 22:31:14 +0200 To: eggert@cs.ucla.edu, 16895@debbugs.gnu.org Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions References: <530F7729.4080705@cs.ucla.edu> In-reply-to: <530F7729.4080705@cs.ucla.edu> User-Agent: Heirloom mailx 12.5 6/20/10 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 16895 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi Paul. > Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions > To: 16895@debbugs.gnu.org > Date: Thu, 27 Feb 2014 09:34:33 -0800 > From: Paul Eggert > > I'm afraid there are several problems in the dfa code. I still don't > have a handle on all of them, but here's my first patch to deal with the > first major one I found. Patterns like [a-[.z.]], which caused 'grep' > to dump core until recently, still aren't being handled correctly, and > there are several closely related bugs here. I've taken the liberty of > pushing the attached patch. Thanks. This looks promising. A few comments / questions. > +/* Return true if the current locale is known to be a unibyte locale > + without multicharacter collating sequences and where range > + comparisons simply use the native encoding. These locales can be > + processed more efficiently. */ > + > +static bool > +using_simple_locale (void) > +{ > + /* True if the native character set is known to be compatible with > + the C locale. The following test isn't perfect, but it's good > + enough in practice, as only ASCII and EBCDIC are in common use > + and this test correctly accepts ASCII and rejects EBCDIC. */ > + enum { native_c_charset = > + ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12 > + && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35 > + && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41 > + && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46 > + && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59 > + && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65 > + && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94 > + && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124 > + && '}' == 125 && '~' == 126) > + }; What a mouthful! Is all that really necessary? > + if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I'd suggest parentheses around the bit with the bitwise operator, both for readability and to match the rest of the code. > @@ -1000,7 +1043,10 @@ parse_bracket_exp (void) > /* Fetch bracket. */ > FETCH_WC (c, wc, _("unbalanced [")); > if (c1 == ':') > - /* build character class. */ > + /* Build character class. POSIX allows character > + classes to match multicharacter collating elements, > + but the regex code does not support that, so do not > + worry about that possibility. */ I thought GLIBC did support them? I will try this out in gawk, sometime in the next few days and let you know how it goes. Thanks for the work! Arnold From debbugs-submit-bounces@debbugs.gnu.org Thu Feb 27 16:02:08 2014 Received: (at control) by debbugs.gnu.org; 27 Feb 2014 21:02:08 +0000 Received: from localhost ([127.0.0.1]:43120 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ85o-0004yp-9W for submit@debbugs.gnu.org; Thu, 27 Feb 2014 16:02:08 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:32987) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ85m-0004yh-Ji for control@debbugs.gnu.org; Thu, 27 Feb 2014 16:02:07 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 2B2ED39E8013 for ; Thu, 27 Feb 2014 13:02:06 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Tmak10ZPqw98 for ; Thu, 27 Feb 2014 13:02:05 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id DF29039E8008 for ; Thu, 27 Feb 2014 13:02:05 -0800 (PST) Message-ID: <530FA7CD.1040307@cs.ucla.edu> Date: Thu, 27 Feb 2014 13:02:05 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: grep bugs fixed Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) close 16895 16232 16777 thanks These bugs all seem to have been fixed by recent changes on master. From debbugs-submit-bounces@debbugs.gnu.org Thu Feb 27 16:24:58 2014 Received: (at 16895) by debbugs.gnu.org; 27 Feb 2014 21:24:58 +0000 Received: from localhost ([127.0.0.1]:43143 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ8Ru-0006qo-7j for submit@debbugs.gnu.org; Thu, 27 Feb 2014 16:24:58 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:34382) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJ8Rr-0006qc-1o for 16895@debbugs.gnu.org; Thu, 27 Feb 2014 16:24:55 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 505F039E8019; Thu, 27 Feb 2014 13:24:54 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ee2I-RJfHXuJ; Thu, 27 Feb 2014 13:24:53 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id A84B839E8015; Thu, 27 Feb 2014 13:24:53 -0800 (PST) Message-ID: <530FAD25.50409@cs.ucla.edu> Date: Thu, 27 Feb 2014 13:24:53 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Aharon Robbins , 16895@debbugs.gnu.org Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions References: <530F7729.4080705@cs.ucla.edu> <201402272031.s1RKVEM3003503@skeeve.com> In-Reply-To: <201402272031.s1RKVEM3003503@skeeve.com> Content-Type: multipart/mixed; boundary="------------070602090407080505000700" X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 16895 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) This is a multi-part message in MIME format. --------------070602090407080505000700 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 02/27/2014 12:31 PM, Aharon Robbins wrote: > What a mouthful! Is all that really necessary? > You should have seen it before I trimmed it down; it listed every POSIX character. I dunno, maybe it could be trimmed, but I was worried about oddball character sets like the unibyte JIS character set that's like ASCII but substitutes Yen-sign for '\', and a couple of other substitutions like that. I figured better safe than sorry. No big deal of course. > I'd suggest parentheses around the bit with the bitwise operator, both > for readability and to match the rest of the code. Done, with the attached patch. Oh, and I fixed an xdigit buglet I found too, in the second patch in the attachment. >> >@@ -1000,7 +1043,10 @@ parse_bracket_exp (void) >> > /* Fetch bracket. */ >> > FETCH_WC (c, wc, _("unbalanced [")); >> > if (c1 == ':') >> >- /* build character class. */ >> >+ /* Build character class. POSIX allows character >> >+ classes to match multicharacter collating elements, >> >+ but the regex code does not support that, so do not >> >+ worry about that possibility. */ > I thought GLIBC did support them? Source code says no. That is, [[:alpha:]] never matches a multicharacter collating sequence. [[=a=]] might do so, but [[:alpha:]] doesn't. (Unless I'm reading the source code wrong, which is possible. It's not documented either way, as far as I know.) --------------070602090407080505000700 Content-Type: text/x-patch; name="grep.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="grep.diff" >From 7725d64fb955e9491a0f1e9a95a655f67e0ab74e Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 27 Feb 2014 13:17:45 -0800 Subject: [PATCH 1/2] * src/dfa.c (parse_bracket_exp): Parenthesize. --- src/dfa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/dfa.c b/src/dfa.c index 65ab5d6..a49b834 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -1023,7 +1023,7 @@ parse_bracket_exp (void) char str[MAX_BRACKET_STRING_LEN + 1]; FETCH_WC (c1, wc1, _("unbalanced [")); - if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES) + if ((c1 == ':' && (syntax_bits & RE_CHAR_CLASSES)) || c1 == '.' || c1 == '=') { size_t len = 0; -- 1.8.5.3 >From 73dc80d42091a2c3d49dd2d9684e65b1107334a2 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 27 Feb 2014 13:19:33 -0800 Subject: [PATCH 2/2] * src/dfa.c (prednames): POSIX allows [[:xdigit:]] to match multibyte chars. --- src/dfa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/dfa.c b/src/dfa.c index a49b834..f4590da 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -926,7 +926,7 @@ static const struct dfa_ctype prednames[] = { {"upper", isupper, false}, {"lower", islower, false}, {"digit", isdigit, true}, - {"xdigit", isxdigit, true}, + {"xdigit", isxdigit, false}, {"space", isspace, false}, {"punct", ispunct, false}, {"alnum", isalnum, false}, -- 1.8.5.3 --------------070602090407080505000700-- From debbugs-submit-bounces@debbugs.gnu.org Fri Feb 28 07:38:13 2014 Received: (at 16895) by debbugs.gnu.org; 28 Feb 2014 12:38:13 +0000 Received: from localhost ([127.0.0.1]:43747 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJMhg-0006GP-0i for submit@debbugs.gnu.org; Fri, 28 Feb 2014 07:38:12 -0500 Received: from mxout5.netvision.net.il ([194.90.6.65]:54882) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJMhd-0006GE-QJ for 16895@debbugs.gnu.org; Fri, 28 Feb 2014 07:38:10 -0500 MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from skeeve.com ([93.172.103.69]) by mxout5.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTPS id <0N1P004Q2HRD3GA0@mxout5.netvision.net.il> for 16895@debbugs.gnu.org; Fri, 28 Feb 2014 14:38:01 +0200 (IST) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.14.4/8.14.4/Debian-2ubuntu2.1) with ESMTP id s1SCc0YQ019731; Fri, 28 Feb 2014 14:38:00 +0200 Received: (from arnold@localhost) by skeeve.com (8.14.4/8.14.4/Submit) id s1SCbx5S019730; Fri, 28 Feb 2014 14:37:59 +0200 From: Aharon Robbins Message-id: <201402281237.s1SCbx5S019730@skeeve.com> Date: Fri, 28 Feb 2014 14:37:59 +0200 To: eggert@cs.ucla.edu, 16895@debbugs.gnu.org Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions References: <530F7729.4080705@cs.ucla.edu> <201402272031.s1RKVEM3003503@skeeve.com> <530FAD25.50409@cs.ucla.edu> In-reply-to: <530FAD25.50409@cs.ucla.edu> User-Agent: Heirloom mailx 12.5 6/20/10 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 16895 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi Paul. > Date: Thu, 27 Feb 2014 13:24:53 -0800 > From: Paul Eggert > Organization: UCLA Computer Science Department > To: Aharon Robbins , 16895@debbugs.gnu.org > Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions OK - I tried out that patch (+ the two successors) in gawk and it works fine, even causing a test that failed to now succeed (since it falls back to regex). I've merged and pushed the change. I definitely owe you some beer for this one. :-) > On 02/27/2014 12:31 PM, Aharon Robbins wrote: > > What a mouthful! Is all that really necessary? > > You should have seen it before I trimmed it down; it listed every POSIX > character. I dunno, maybe it could be trimmed, but I was worried about > oddball character sets like the unibyte JIS character set that's like > ASCII but substitutes Yen-sign for '\', and a couple of other > substitutions like that. I figured better safe than sorry. No big deal > of course. Is that done at compile time in those locales, or at run time? What you've put in is a compile time check. I ask out of total ignorance and am wondering how it works. > >> >- /* build character class. */ > >> >+ /* Build character class. POSIX allows character > >> >+ classes to match multicharacter collating elements, > >> >+ but the regex code does not support that, so do not > >> >+ worry about that possibility. */ > > > > I thought GLIBC did support them? > > Source code says no. That is, [[:alpha:]] never matches a > multicharacter collating sequence. [[=a=]] might do so, but [[:alpha:]] > doesn't. (Unless I'm reading the source code wrong, which is possible. > It's not documented either way, as far as I know.) Ah. I misunderstood the context. GLIBC does support [[=a=]] and [[.ch.]], though, right? Thanks! Arnold From debbugs-submit-bounces@debbugs.gnu.org Fri Feb 28 16:21:37 2014 Received: (at 16895) by debbugs.gnu.org; 28 Feb 2014 21:21:37 +0000 Received: from localhost ([127.0.0.1]:45026 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJUsD-0005hD-3O for submit@debbugs.gnu.org; Fri, 28 Feb 2014 16:21:37 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:40817) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WJUsA-0005h2-7z for 16895@debbugs.gnu.org; Fri, 28 Feb 2014 16:21:34 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id C9A3B39E8011; Fri, 28 Feb 2014 13:21:33 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ceCluOO9Zkwc; Fri, 28 Feb 2014 13:21:33 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 7754E39E8008; Fri, 28 Feb 2014 13:21:33 -0800 (PST) Message-ID: <5310FDDD.7060305@cs.ucla.edu> Date: Fri, 28 Feb 2014 13:21:33 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Aharon Robbins , 16895@debbugs.gnu.org Subject: Re: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions References: <530F7729.4080705@cs.ucla.edu> <201402272031.s1RKVEM3003503@skeeve.com> <530FAD25.50409@cs.ucla.edu> <201402281237.s1SCbx5S019730@skeeve.com> In-Reply-To: <201402281237.s1SCbx5S019730@skeeve.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 16895 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) On 02/28/2014 04:37 AM, Aharon Robbins wrote: > Is that done at compile time in those locales, or at run time? A bit of both. At compile-time we check that the compile-time character set is compatible with ASCII (i.e., the C aka POSIX locale). At run-time we check that the locale is indeed C aka POSIX. Both checks need to succeed.. > GLIBC does support [[=a=]] and [[.ch.]], though, right? > Yes. From unknown Wed Jun 18 23:13:41 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sat, 29 Mar 2014 11:24:09 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator