#21251 - sed: POSIX and the z command

GNU bug report logs - #21251
sed: POSIX and the z command

Package: sed;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Thu, 13 Aug 2015 14:56:01 UTC

Severity: wishlist

Tags: moreinfo, notabug

Message #14 received at 21251 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com> To: Stephane Chazelas <stephane.chazelas <at> gmail.com> Cc: 21251 <at> debbugs.gnu.org Subject: Re: bug#21251: sed: POSIX and the z command Date: Sat, 28 Jan 2017 21:04:25 +0000

Hello Stephane, On Sat, Jan 28, 2017 at 10:01:55AM +0000, Stephane Chazelas wrote: >It doesn't preclude the use of regexec. It just leaves the >behaviour unspecified when the input is not text Thanks for the clarification. >I'd argue that for sequences of bytes that don't form valid >characters, it would be nicer if "." or "[^anything]" matched >each of the individual bytes. Concretely, GNU sed uses several regex engines now (gnulib's dfa for fast matching, then either glibc's or gnulib's RE for general matching and substitution). To support this behaviour we'll need to ensure all of them behave in the same reproducible and reliable manner (not impossible, just a TODO). >You can still find the discussion using the NNTP interface. I >attach the most relevant message (from Geoff Clare of the Austin >group). I can send you the whole discussion as a mailbox file if >you like. I would appricate if you could send it to me - I'm interested in multibyte processing for other gnu programs as well. >From: Geoff Clare <gwc <at> opengroup.org> >> GNU sed even went as far as defining a new command for emptying >> the pattern space to work around that problem: >> [...] >> Is that claim (about it being a POSIX requirement) true? > >I think it's true for regexec(), but not for sed. > >(Perhaps we should add a REG_EILSEQ error return for regexec().) > >> I'd expect the behaviour to be unspecified if the input is not >> text (as would be the case if there are invalid multi-byte >> sequences). > >Exactly. So the above somewhat confuses me (as my previous email): Let's say I was to write a new simple 'sed' for POSIX systems. If POSIX/OpenGroup encourages me (as a software writer for posix systems) to use the POSIX regexec API, then implicitly my 'sed' program wouldn't match invalid multibyte sequences. But if OpenGroup wants me to match invalid multibyte sequences in 'sed'. it means that in practical terms I shouldn't use POSIX API and implement my own regex engine... You compared it with LINE_MAX, but realistically, implementing support for lines longer than LINE_MAX is very different scale of effort than implementing a new regex engine... What am I missing ? Thanks! - assaf

This bug report was last modified 6 years and 313 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #21251 sed: POSIX and the z command

GNU bug report logs - #21251
sed: POSIX and the z command