Thanks for the reply. This might not be a bug though; I sent a similar mail (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05881.html) to Austin Group mailing list asking what's the expected behavior in this case, and I was told (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05891.html) both behaviors -yielding n or empty line- are correct and standard should *probably* be amended to explicitly state that this is unspecified. And apparently (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05893.html) some other UNIXes adopted the same practice as GNU sed (or vice versa, I don't know which one is older).
tags 40242 confirmed
stop
Hello,
On 2020-03-25 11:30 p.m., Oğuz wrote:
While '\t' matches a literal 't' when 't' is the delimiter, '\n' does not
match 'n' when 'n' is the delimiter. See:
$ echo t | sed 'st\ttt' | xxd
00000000: 0a .
$
$ echo n | sed 'sn\nnn' | xxd
00000000: 6e0a
Is this a bug or is there a sound logic behind this?
Thank you for finding this interesting edge-case.
I think it is a (very old) bug. I'm not sure about its origin,
perhaps Jim or Paolo can comment.
First,
let's start with what's expected (slightly modifying your examples):
The canonical usage, here "\t" becomes a TAB, and "t" is not replaced:
$ printf t | sed 's/\t//' | od -a -An
t
Then, using a different character "q" instead of "/", works the same:
$ printf t | sed 'sq\tqq' | od -a -An
t
The sed manual says (in section "3.3 The s command"):
"
The / characters may be uniformly replaced by any other single
character within any given s command.
The / character (or whatever other character is used in its
stead) can appear in the regexp or replacement only if it is
preceded by a \ character.
"
This is the reason "\t" represents a regular "t" (not TAB)
*if* the substitute command's delimiter is "t" as well:
$ printf t | sed 'st\ttt' | od -a -An
[no output, as expected]
And similarly for other characters:
printf x | sed 'sx\xxx' | od -a -An
printf a | sed 'sa\aaa' | od -a -An
printf z | sed 'sz\zzz' | od -a -An
[no output, as expected]
---
Second,
The "\n" case behaves differently, regardless of which
separator is used. It is always treated as "\n" (new line),
never literal "n", even if the separator is "n":
These are correct, as expected:
$ printf n | sed 's/\n//' | od -a -An
n
$ printf n | sed 's/\n//' | od -a -An
n
$ printf n | sed 'sx\nxx' | od -a -An
n
Here, we'd expect "\n" to be treated as a literal "n" character,
not "\n", but it is not (as you've found):
$ printf n | sed 'sn\nnn' | od -a -An
n
----
In the code, the "match_slash" function [1] is used to find
the delimiters of the "s" command (typically "slashes").
Special handling happens if a slash is found [2],
And in lines 557-8 there's this conditional:
else if (ch == 'n' && regex)
ch = '\n';
Which forces any "\n" to be a new-line, regardless if the
delimiter itself was an "n".
[1] https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c #n531
[2] https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c #n552
In older sed versions, these two lines where protected by
"#ifndef REG_PERL" [3] so perhaps it had something to do with regex variants. But the origin of this line predates the git history.
Jim/Paolo - any ideas what this relates to?
https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c ?id=41a169a9a14b5bdc736313eb41 1f02bcbe1c046d#n551
---
Interestingly, removing these two lines does not cause
any test failures, so this might be easy to fix without causing
any regressions.
For now I'm leaving this item open until we decide how to deal with it.
regards,
- assaf