Package: sed;
Reported by: Eric Blake <eblake <at> redhat.com>
Date: Mon, 25 Apr 2022 16:07:01 UTC
Severity: normal
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Christoph Anton Mitterer <calestyo <at> scientia.org> To: Eric Blake <eblake <at> redhat.com> Cc: Geoff Clare <gwc <at> opengroup.org>, bug-sed <at> gnu.org, austin-group-l <at> opengroup.org Subject: Re: [Issue 8 drafts 0001556]: clarify meaning of \n used in a bracket expression in a sed context address or s-command Date: Tue, 26 Apr 2022 01:44:21 +0200
Hey Eric. On Mon, 2022-04-25 at 11:06 -0500, Eric Blake wrote: > The GNU sed developers can be reached at bug-sed <at> gnu.orgĀ (per the > output of 'sed --help', and as done in this email). Ah, I think I had written to sed-devel in January. > So if I'm restating your complaint correctly "complaint" is a bit harsh ;-) ... it's not my intention to step on anyone's toes, just to hoping to help with portability. > you are worried that GNU > sed's non-POSIX behavior (what you get by default when > POSIXLY_CORRECT > is not set) Speaking of POSIXLY_CORRECT ... I'm not sure how much that really helps in practise. First, the reality probably is that most users won't read the info page from top to bottom and even if they do, it's not for sure that they really understand the implications of e.g. '[\n]' and that they'd need to use POSIXLY_CORRECT. Sure you can argue now that this is then the fault of the user, but I don't think that this helps in practise. Second, (sed) scripts may flow in both directions, i.e. from an implementation that is (per default) POSIXly correct to GNU sed (which per default is not) - and vice versa. So when such script comes from a non-GNU-sed and uses '[\n]' in the strict POSIX sense, it would likely be just used as is with GNU sed, that it has different semantics is possibly not immaculately visible, as there's no error or so, and thus people probably won't realise that they'd need to set POSIXLY_CORRECT non empty for such "foreign" scripts. An the same would likely happen in the other direction. The average GNU sed user may perhaps never notice that '[\n]' being newline is a GNU speciality unless he knows the standard well. If that is then used on a non-GNU-sed, sematics change again. > treats the four-byte sequence '[\n]' in an s-command regex > as a bracket expression for the single character of a literal newline > (that is, interpreting \n as an escape sequence even though it is > inside a bracket expression), instead of as a bracket expression for > either of a literal backslash or literal n; but concur that its > behavior when being POSIX-compliant matches the POSIX rules. I guess it's at least quite unfortunate that it does so. Especially because GNU seems to really do this only with sed, e.g. grep (with POSIXLY_CORRECT UNset) seems to interpret '[\n]' POSIXly correct... $ printf 'a\nb' | grep -z '^a[\n]b$' ; echo $ printf 'a\\b' | grep -z '^a[\n]b$' ; echo a\b $ printf 'anb' | grep -z '^a[\n]b$' ; echo anb ... which I'd blindly guess is also not necessarily clear to the average GNU grep/sed user. > POSIX can't control what GNU sed does when in non-POSIX mode. Sure... and even if it would do so in POSIX mode, there's no POSIX police ;-) Nevertheless... in practise most people will just assume that the default mode is mostly POSIX compliant, except perhaps for "graceful" extensions. All these GNU extensions (like '\+' and friends for BREs... or '\s' and friends for BREs and EREs) still work nicely with POSIX, cause POSIX says that these produce undefined results, so if someone really wanted to be portable, he didn't use it. But this is different for the sed + '[\n]' case. Some who restricted himself to just POSIX would still get into troubles. And sure, strictly speaking you're of course right, and only with POSIXLY_CORRECT non-empty, GNU sed is guaranteed to behave so - but again, I'd blindly guess that in practise that goes quite easily unnoticed. > But it > can document a recommendation to spell the bracket expression > intended > to match either a backslash or an n in the order [n\] to avoid any > potential confusion with [\n] being interpreted as an escape > sequence. The problem remains of course for any scripts which are written&tested with sed implementations that behave the other way and which are then used with GNU sed. The best (for portability) would probably if GNU sed could change the behaviour, but I see of course that unfortunately this is likely not easily possible either. I just searched the sed info page... and that seems to basically say: > '[LIST]' > '[^LIST]' > Matches any single character in LIST: for example, '[aeiou]' > matches all vowels. A list may include sequences like > 'CHAR1-CHAR2', which matches any character between (inclusive) > CHAR1 and CHAR2. *Note Character Classes and Bracket > Expressions::. ... a bit further down ... > '\n' > Matches the newline character. IMO, that's however "outside" of the part for bracket expressions, because everything else that is described on the same level (like '\+' or '\DIGIT') is clearly *not* intended to work inside GNU sed bracket expression, right? However later in "5.5 Character Classes and Bracket Expressions": > Also, when not in 'POSIXLY_CORRECT' mode, special escapes like '\n' > and '\t' are recognized within LIST. *Note Escapes::. So I guess at this point it's game over and GNU sed could never really change behaviour without breaking gazillion things. btw: I'd hope that these \<char> escape sequences produce at least all the literal <char>, when <char> is also the delimiter. > Or am I missing something else that you are proposing that either the > Austin Group should do in its documentation efforts, and/or which GNU > sed should do to comply with the recent Austin Group recommendations? Well I guess given that GNU sed explicitly documented this behaviour for the non-'POSIXLY_CORRECT'-mode) means that there cannot anything be done than documenting it as good as possible (on both sides). Perhaps better to use '\\' for any literally meant <backslash>, than to just put it at the end of the list, cause some implementations could also think about giving special meaning to '\]'. Really unfortunate though, especially that it's then not even consistent across GNU (i.e. also in GNU sed). Thanks, Chris.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.