From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 25 23:16:58 2024 Received: (at submit) by debbugs.gnu.org; 26 Jan 2024 04:16:58 +0000 Received: from localhost ([127.0.0.1]:49829 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1rTDeD-0007eu-NY for submit@debbugs.gnu.org; Thu, 25 Jan 2024 23:16:58 -0500 Received: from lists.gnu.org ([2001:470:142::17]:41200) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1rT2sv-0002fX-5e for submit@debbugs.gnu.org; Thu, 25 Jan 2024 11:47:28 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rT2sj-0003GM-MN for bug-sed@gnu.org; Thu, 25 Jan 2024 11:47:13 -0500 Received: from resqmta-c1p-024060.sys.comcast.net ([2001:558:fd00:56::3]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rT2sa-00067v-62 for bug-sed@gnu.org; Thu, 25 Jan 2024 11:47:10 -0500 Received: from resomta-c1p-023265.sys.comcast.net ([96.102.18.226]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 256/256 bits) (Client did not present a certificate) by resqmta-c1p-024060.sys.comcast.net with ESMTP id T1u8rwJVadWKpT2sSrKaUb; Thu, 25 Jan 2024 16:46:56 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=20190202a; t=1706201216; bh=G63CfaRvS3rcO514nmTBQ9vQ8vI8MwuFXaFq0bG/ErE=; h=Received:Received:Content-Type:Message-ID:Date:MIME-Version:From: Subject:To:Xfinity-Spam-Result; b=g/c8aev/LHuq7UVNA8hzvTGiGlERLPl3cr7aMLKiTfBEn7U5wV1Y8dVaJH7fQAPri 2DcEflATTHG6wn0iMwzQGQ+R+zT0JVSAsuz9uAAyRO/MUM5SUy1tyhn1+VSC7GMqCt XPiCOpkOQSUSpxJZeriuXq0XkjDo0JNp1bwGNNwKFwsk8T0uNHJcGTBkrbJLNZLcvD a+Iw/+qP0hR164NZGEkAvsVBBX+L5NPtb5jMPyKbt4tsIwaogJaymDy6VCr7d2eRmA oOx0NWRc08OMOdaZaVR2sK/Gj5lU3+OCzdjRY/NAyMMOPcc1cy9G9lNQMqBgu5TUm5 9ikiDNt7N5jyw== Received: from [IPV6:::1] ([IPv6:2601:249:d01:7420:59c0:175b:1986:7915]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 256/256 bits) (Client did not present a certificate) by resomta-c1p-023265.sys.comcast.net with ESMTPSA id T2qxrPeCZFTibT2s6rJi02; Thu, 25 Jan 2024 16:46:35 +0000 Content-Type: multipart/alternative; boundary="------------fVSzpxdOAXNSVCXm0u1O39fP" Message-ID: Date: Thu, 25 Jan 2024 10:46:34 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Ed Morton Subject: GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE capture groups and backreferences To: bug-sed@gnu.org, bug-grep@gnu.org Content-Language: en-US X-Antivirus: Avast (VPS 240125-4, 1/25/2024), Outbound message X-Antivirus-Status: Clean X-CMAE-Envelope: MS4xfEg2i0Y4e41Ep+k/1sifNhz4Zmcjb7RZV7ao4JCFBX3ogEtCEm6z8JC6pGGcGpT/kgblE4cCmA4HuJ5JqtdOHIP21kI6O2VwyF5Ijmlt/ozSNjinMgh0 kNT8xVUmw3fKHlf+/7lxGzF+c8NNPFwzD24Ig0UpaEm7EKp6MAQbJqlfuGbztg2Z48XRRZVUUYsyWX6waLmTguAPjeSOq+2iNi4dNX9pRBGi5MxtEdXVaLh8 KfYZSLrrVz94uGtHK4He5aK8jPvMIh5kRje0nk1B584= Received-SPF: pass client-ip=2001:558:fd00:56::3; envelope-from=mortoneccc@comcast.net; helo=resqmta-c1p-024060.sys.comcast.net X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=unavailable autolearn_force=no X-Spam_action: no action X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 25 Jan 2024 23:16:54 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------fVSzpxdOAXNSVCXm0u1O39fP Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit There are issues (mostly common but some not) using a regexp like this: |^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$| with GNU grep and GNU sed, hence my contacting both mailing lists but apologies if that was the wrong starting point. This started out as a question on StackOverflow, (https://stackoverflow.com/questions/77820540/searching-palindromes-with-grep-e-egrep/77861446?noredirect=1#comment137299746_77861446) but my "answer" and some comments from there copied below so you don't have to look anywhere else for a description of the issues. Given this input file: |a| |ab| |abba| |abcdef| |abcba| |zufolo| |||Removing the `$` from the end of the regexp (i.e. making it less restrictive) produces fewer matches, which is the opposite of what it should do: a) With the `$` at the end of the regexp: $ grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample a abba abcba zufolo b) Without the `$` at the end of the regexp: $ grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample a abba abcba It's not just GNU grep that behaves strangely, GNU sed has the same behavior from the question when just matching with `sed -nE '/.../p' sample` as GNU `grep` does AND sed behaves differently if we're just doing a match vs if we're doing a match + replace. For example here's `sed` doing a match+replacement and behaving the same way as `grep` above: a) With the `$` at the end of the regexp: $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample a abba abcba zufolo b) Without the `$` at the end of the regexp: $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample a abba abcba but here's sed just doing a match and behaving differently from any of the above: a) With the `$` at the end of the regexp (note the extra `ab` in the output): $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample a ab abba abcba zufolo b) Without the `$` at the end of the regexp (note the extra `ab` and `abcdef` in the output): $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample a ab abba abcdef abcba zufolo Also interestingly this: $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample outputs: <>zufolo the last line of which means the regexp is apparently matching the start of the line and ignoring the `$` end-of-string metachar present in the regexp! The odd behavior isn't just associated with using `-E`, though, if I remove `-E` and just use [POSIX compliant BREs](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03) then: a) With the `$` at the end of the regexp: $ grep '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$' sample a abba abcba zufolo

$ sed -n 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p' sample a abba abcba zufolo b) Without the `$` at the end of the regexp: $ grep '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1' sample a abba abcba

$ sed -n 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p' sample a abba abcba and again just doing a match in sed below behaves differently from the sed match+replacements above: a) With the `$` at the end of the regexp: $ sed -n '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p' sample a ab abba abcba zufolo b) Without the `$` at the end of the regexp: $ sed -n '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p' sample a ab abba abcdef abcba zufolo The above shows that, given the same regexp, sed is apparently matching different strings depending on whether it's doing a substitution or not. These are the version I was using when testing above: $ grep --version | head -1 grep (GNU grep) 3.11 $ sed --version | head -1 sed (GNU sed) 4.9 It was later pointed out that grep in git-=bash produces an error message and core dumps given the original regexp above|, e.g. |grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample| and |grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample| both output|: a assertion "num >= 0" failed: file "regexec.c", line 1394, function: pop_fail_stack Aborted (core dumped)|. Sorry, I can't copy the core off that machine for corporate reasons. Those git-bash tests were using |$ echo $BASH_VERSION| |5.2.15(1)-release ||$ grep --version||grep (GNU grep) 3.0| |Regards, Ed Morton | --------------fVSzpxdOAXNSVCXm0u1O39fP Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit There are issues (mostly common but some not) using a regexp like this:

^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$
with GNU grep and GNU sed, hence my contacting both mailing lists but apologies if that was the wrong starting point.

This started out as a question on StackOverflow, (
https://stackoverflow.com/questions/77820540/searching-palindromes-with-grep-e-egrep/77861446?noredirect=1#comment137299746_77861446) but my "answer" and some comments from there copied below so you don't have to look anywhere else for a description of the issues.

Given this input file:
a
ab
abba
abcdef
abcba
zufolo

Removing the `$` from the end of the regexp (i.e. making it less restrictive) produces fewer matches, which is the opposite of what it should do:

a) With the `$` at the end of the regexp:

    $ grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample
    a
    abba
    abcba
    zufolo

b) Without the `$` at the end of the regexp:

    $ grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample
    a
    abba
    abcba

It's not just GNU grep that behaves strangely, GNU sed has the same behavior from the question when just matching with `sed -nE '/.../p' sample` as GNU `grep` does AND sed behaves differently if we're just doing a match vs if we're doing a match + replace.

For example here's `sed` doing a match+replacement and behaving the same way as `grep` above:

a) With the `$` at the end of the regexp:

    $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample
    a
    abba
    abcba
    zufolo

b) Without the `$` at the end of the regexp:

    $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample
    a
    abba
    abcba

but here's sed just doing a match and behaving differently from any of the above:

a) With the `$` at the end of the regexp (note the extra `ab` in the output):

    $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample
    a
    ab
    abba
    abcba
    zufolo

b) Without the `$` at the end of the regexp (note the extra `ab` and `abcdef` in  the output):

    $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample
    a
    ab
    abba
    abcdef
    abcba
    zufolo

Also interestingly this:

    $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample

outputs:

    <a>
    <abba>
    <abcba>
    <>zufolo

the last line of which means the regexp is apparently matching the start of the line and ignoring the `$` end-of-string metachar present in the regexp! 

The odd behavior isn't just associated with using `-E`, though, if I remove `-E` and just use [POSIX compliant BREs](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03) then:

a) With the `$` at the end of the regexp:

    $ grep '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$' sample
    a
    abba
    abcba
    zufolo

<p>

    $ sed -n 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p' sample
    a
    abba
    abcba
    zufolo

b) Without the `$` at the end of the regexp:

    $ grep '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1' sample
    a
    abba
    abcba

<p>

    $ sed -n 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p' sample
    a
    abba
    abcba

and again just doing a match in sed below behaves differently from the sed match+replacements above:

a) With the `$` at the end of the regexp:

    $ sed -n '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p' sample
    a
    ab
    abba
    abcba
    zufolo

b) Without the `$` at the end of the regexp:

    $ sed -n '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p' sample
    a
    ab
    abba
    abcdef
    abcba
    zufolo

The above shows that, given the same regexp, sed is apparently matching different strings depending on whether it's doing a substitution or not.

These are the version I was using when testing above:

    $ grep --version | head -1
    grep (GNU grep) 3.11

    $ sed --version | head -1
    sed (GNU sed) 4.9

It was later pointed out that grep in git-=bash produces an error message and core dumps given the original regexp above, e.g.

    grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample

and

    grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample

both output:

    a
    assertion "num >= 0" failed: file "regexec.c", line 1394, function: pop_fail_stack
                                                                                      Aborted (core dumped).

Sorry, I can't copy the core off that machine for corporate reasons.

Those git-bash tests were using

    $ echo $BASH_VERSION
    5.2.15(1)-release

    $ grep --version
    grep (GNU grep) 3.0

Regards,

	Ed Morton
--------------fVSzpxdOAXNSVCXm0u1O39fP--