GNU bug report logs - #21251
sed: POSIX and the z command

Previous Next

Package: sed;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Thu, 13 Aug 2015 14:56:01 UTC

Severity: wishlist

Tags: moreinfo, notabug

Full log


Message #14 received at 21251 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 21251 <at> debbugs.gnu.org
Subject: Re: bug#21251: sed: POSIX and the z command
Date: Sat, 28 Jan 2017 21:04:25 +0000
Hello Stephane,

On Sat, Jan 28, 2017 at 10:01:55AM +0000, Stephane Chazelas wrote:
>It doesn't preclude the use of regexec. It just leaves the
>behaviour unspecified when the input is not text

Thanks for the clarification.

>I'd argue that for sequences of bytes that don't form valid
>characters, it would be nicer if "." or "[^anything]" matched
>each of the individual bytes.

Concretely, GNU sed uses several regex engines now (gnulib's dfa for
fast matching, then either glibc's or gnulib's RE for general matching 
and substitution).

To support this behaviour we'll need to ensure all of them behave in
the same reproducible and reliable manner (not impossible, just a TODO).

>You can still find the discussion using the NNTP interface. I
>attach the most relevant message (from Geoff Clare of the Austin
>group). I can send you the whole discussion as a mailbox file if
>you like.

I would appricate if you could send it to me - I'm interested
in multibyte processing for other gnu programs as well.


>From: Geoff Clare <gwc <at> opengroup.org>
>> GNU sed even went as far as defining a new command for emptying
>> the pattern space to work around that problem:
>> [...]
>> Is that claim (about it being a POSIX requirement) true?
>
>I think it's true for regexec(), but not for sed.
>
>(Perhaps we should add a REG_EILSEQ error return for regexec().)
>
>> I'd expect the behaviour to be unspecified if the input is not
>> text (as would be the case if there are invalid multi-byte
>> sequences).
>
>Exactly.

So the above somewhat confuses me (as my previous email):

Let's say I was to write a new simple 'sed' for POSIX systems.
If POSIX/OpenGroup encourages me (as a software writer for posix
systems) to use the POSIX regexec API, then implicitly my 'sed'
program wouldn't match invalid multibyte sequences.
But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
it means that in practical terms I shouldn't use POSIX API and
implement my own regex engine...

You compared it with LINE_MAX, but realistically, implementing support 
for lines longer than LINE_MAX is very different scale of effort than 
implementing a new regex  engine...

What am I missing ?

Thanks!
- assaf






This bug report was last modified 6 years and 313 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.