GNU bug report logs - #68726
GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE capture groups and backreferences

Previous Next

Package: sed;

Reported by: Ed Morton <mortoneccc <at> comcast.net>

Date: Fri, 26 Jan 2024 04:17:02 UTC

Severity: normal

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ed Morton <mortoneccc <at> comcast.net>
To: bug-sed <at> gnu.org, bug-grep <at> gnu.org
Subject: GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE
 capture groups and backreferences
Date: Thu, 25 Jan 2024 10:46:34 -0600
[Message part 1 (text/plain, inline)]
There are issues (mostly common but some not) using a regexp like this:

   |^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$|

with GNU grep and GNU sed, hence my contacting both mailing lists but 
apologies if that was the wrong starting point.

This started out as a question on StackOverflow, 
(https://stackoverflow.com/questions/77820540/searching-palindromes-with-grep-e-egrep/77861446?noredirect=1#comment137299746_77861446) 
but my "answer" and some comments from there copied below so you don't 
have to look anywhere else for a description of the issues.

Given this input file:

|a|
|ab|
|abba|
|abcdef|
|abcba|
|zufolo|
|||Removing the `$` from the end of the regexp (i.e. making it less 
restrictive) produces fewer matches, which is the opposite of what it 
should do: a) With the `$` at the end of the regexp: $ grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample a abba abcba zufolo b) 
Without the `$` at the end of the regexp: $ grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample a abba abcba It's not just 
GNU grep that behaves strangely, GNU sed has the same behavior from the 
question when just matching with `sed -nE '/.../p' sample` as GNU `grep` 
does AND sed behaves differently if we're just doing a match vs if we're 
doing a match + replace. For example here's `sed` doing a 
match+replacement and behaving the same way as `grep` above: a) With the 
`$` at the end of the regexp: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample a abba abcba zufolo b) 
Without the `$` at the end of the regexp: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample a abba abcba but here's 
sed just doing a match and behaving differently from any of the above: 
a) With the `$` at the end of the regexp (note the extra `ab` in the 
output): $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample a ab 
abba abcba zufolo b) Without the `$` at the end of the regexp (note the 
extra `ab` and `abcdef` in the output): $ sed -nE 
'/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample a ab abba abcdef abcba 
zufolo Also interestingly this: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample outputs: <a> <abba> 
<abcba> <>zufolo the last line of which means the regexp is apparently 
matching the start of the line and ignoring the `$` end-of-string 
metachar present in the regexp! The odd behavior isn't just associated 
with using `-E`, though, if I remove `-E` and just use [POSIX compliant 
BREs](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03) 
then: a) With the `$` at the end of the regexp: $ grep 
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$' 
sample a abba abcba zufolo <p> $ sed -n 
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p' 
sample a abba abcba zufolo b) Without the `$` at the end of the regexp: 
$ grep 
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1' 
sample a abba abcba <p> $ sed -n 
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p' 
sample a abba abcba and again just doing a match in sed below behaves 
differently from the sed match+replacements above: a) With the `$` at 
the end of the regexp: $ sed -n 
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p' 
sample a ab abba abcba zufolo b) Without the `$` at the end of the 
regexp: $ sed -n 
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p' 
sample a ab abba abcdef abcba zufolo The above shows that, given the 
same regexp, sed is apparently matching different strings depending on 
whether it's doing a substitution or not. These are the version I was 
using when testing above: $ grep --version | head -1 grep (GNU grep) 
3.11 $ sed --version | head -1 sed (GNU sed) 4.9 It was later pointed 
out that grep in git-=bash produces an error message and core dumps 
given the original regexp above|, e.g. |grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample| and |grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample| both output|: a assertion 
"num >= 0" failed: file "regexec.c", line 1394, function: pop_fail_stack 
Aborted (core dumped)|. Sorry, I can't copy the core off that machine 
for corporate reasons. Those git-bash tests were using |$ echo 
$BASH_VERSION| |5.2.15(1)-release ||$ grep --version||grep (GNU grep) 3.0|
|Regards, Ed Morton |
[Message part 2 (text/html, inline)]

This bug report was last modified 1 year and 140 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.