GNU bug report logs - #21251
sed: POSIX and the z command

Previous Next

Package: sed;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Thu, 13 Aug 2015 14:56:01 UTC

Severity: wishlist

Tags: moreinfo, notabug

To reply to this bug, email your comments to 21251 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-sed <at> gnu.org:
bug#21251; Package sed. (Thu, 13 Aug 2015 14:56:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Thu, 13 Aug 2015 14:56:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-sed <at> gnu.org
Subject: sed: POSIX and the z command
Date: Thu, 13 Aug 2015 15:55:20 +0100

Last one for today ;)

The GNU sed documentation has:

`z'
     This command empties the content of pattern space.  It is usually
     the same as `s/.*//', but is more efficient and works in the
     presence of invalid multibyte sequences in the input stream.
     POSIX mandates that such sequences are _not_ matched by `.', so
     that there is no portable way to clear `sed''s buffers in the
     middle of the script in most multibyte locales (including UTF-8
     locales).

The part about the POSIX requirement is not true. The behaviour
of sed on non-text input is unspecified, so it doesn't require
that . not match a byte that is not part of a valid character.

GNU sed's (or grep's for that matters) . (or [^[:alnum:]]...)
could just as well match every byte that doesn't otherwise form
part of a valid character (which would be a much better
behaviour IMO) and still be POSIX compliant.

That POSIX requirement is true for regexec() but not for text
utilities.

See that discussion on the Austin Group mailing list:
http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098

-- 
Stephane

Information forwarded to bug-sed <at> gnu.org:
bug#21251; Package sed. (Sat, 28 Jan 2017 01:49:02 GMT) Full text and rfc822 format available.

Message #8 received at 21251 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 21251 <at> debbugs.gnu.org
Subject: Re: bug#21251: sed: POSIX and the z command
Date: Sat, 28 Jan 2017 01:48:19 +0000

Hello Stephane,

Sorry for the delayed response. I'm triaging old sed bugs.

On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote:
> [...] The behaviour
> of sed on non-text input is unspecified, so it doesn't require
> that . not match a byte that is not part of a valid character.
> [...]
> That POSIX requirement is true for regexec() but not for text
> utilities.

I'm far from familiar with POSIX intricacies, but doesn't that sound a 
bit strange ?  I would naively think that POSIX would encourage 
POSIX-compliant test utilities to use the system's native regexec 
implenentation, instead of supporting slightl different semantics... 

> See that discussion on the Austin Group mailing list:
> http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098

This link seems broken. Would you know where to find this discussion 
online ?

thanks,
- assaf

Information forwarded to bug-sed <at> gnu.org:
bug#21251; Package sed. (Sat, 28 Jan 2017 10:03:02 GMT) Full text and rfc822 format available.

Message #11 received at 21251 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 21251 <at> debbugs.gnu.org
Subject: Re: bug#21251: sed: POSIX and the z command
Date: Sat, 28 Jan 2017 10:01:55 +0000

[Message part 1 (text/plain, inline)]

2017-01-28 01:48:19 +0000, Assaf Gordon:
[...]
> On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote:
> >[...] The behaviour
> >of sed on non-text input is unspecified, so it doesn't require
> >that . not match a byte that is not part of a valid character.
> >[...]
> >That POSIX requirement is true for regexec() but not for text
> >utilities.
> 
> I'm far from familiar with POSIX intricacies, but doesn't that sound a bit
> strange ?  I would naively think that POSIX would encourage POSIX-compliant
> test utilities to use the system's native regexec implenentation, instead of
> supporting slightl different semantics...

Hi Assaf,

It doesn't preclude the use of regexec. It just leaves the
behaviour unspecified when the input is not text, like when
lines are longer than LINE_MAX or when they contain NUL bytes or
when they contain sequences of bytes not forming valid
characters or when there are characters after the last newline
character.

Upon sequences of bytes that don't form valid characters, you're
free to exit with an error, shut down the computer, or whatever
you like, POSIX doesn't care.

What POSIX tells the user of the POSIX API (that is script
writers, sed user) is that they can't expect anything on
non-text input.

GNU sed already handles lines longer than LINE_MAX nicely, as
well as lines containing NUL bytes or an unterminated last line.

I'd argue that for sequences of bytes that don't form valid
characters, it would be nicer if "." or "[^anything]" matched
each of the individual bytes. It's what bash's * and ? and
[!anything] fnmatch() patterns do (even though in that case
POSIX seem to forbid it; that has been discussed on the austin
group mailing list as well). 

> >See that discussion on the Austin Group mailing list:
> >http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098
> 
> This link seems broken. Would you know where to find this discussion online
> ?
[...]

Yes. They relied on gmane for the mailing list archive. The web
interface has been discontinued
(https://lars.ingebrigtsen.no/2016/07/28/the-end-of-gmane/),
then taken over by somebody else, but not everything is back.
https://lars.ingebrigtsen.no/2016/09/06/gmane-alive/comment-page-1/

You can still find the discussion using the NNTP interface. I
attach the most relevant message (from Geoff Clare of the Austin
group). I can send you the whole discussion as a mailbox file if
you like.

-- 
Stephane

[Message part 2 (message/rfc822, inline)]

From: Geoff Clare <gwc <at> opengroup.org>
To: austin-group-l <at> opengroup.org
Subject: Re: UTF-8 and non-characters
Date: Wed, 1 Jul 2015 10:55:14 +0100

Stephane Chazelas <stephane.chazelas <at> gmail.com> wrote, on 30 Jun 2015:
>
> Speaking of which, would a pseudo-UTF-8 locale where bytes that
> don't form valid characters are mapped to a character like
> U+FFFD (�) be POSIX compliant.
> 
> Like c3 a9 is é, but c3 41 a9 is �A�
> 
> or if not all mapped to a single character, mapped to dedicated
> unassigned code points (0x7fffff80 to 0x7fffffff for instance)? 
> 
> For instance, above c3 41 a9 being <U+7fffffc3>A<U+7fffffa9>
> 
> If allowed, would that not be desirable (I can see it
> potentially be a problem when processing partial input)?

I think this would cause inconsistency between btowc() and the various
multi-byte to wide-character conversion functions.

If btowc(0xc3) returns a wide character, then mbtowc() on c3 a9 ought
to convert the c3 to that wide character and return 1, instead of
converting c3 a9 to a wide é and returning 2.

Conversely, if btowc(0xc3) returns WEOF, then mbtowc() on c3 41 a9
ought not to convert the c3 to a wide character.

> A common source of bugs and security vulnerabilities with
> UTF-8 is that fact that not all sequences of bytes map to
> characters and in particular that they're not matched by RE's
> "." or ".*" or fnmatch()'s ? or *.
> 
> That's a common problem when you can't guarantee the input is
> valid text for instance for arbitrary file names from the file
> system. That's quite common when dealing with file names that
> were written in a single-byte character set in UTF-8 locales.
> 
> For instance,
> 
> find . -name '*'
> 
> With GNU find at least doesn't match on $'St\xe9phane.txt'
> (Stéphane.txt in the iso8859-1 charset).
> 
> An example of a more serious problem:
> 
> find . ! -name "* *" -exec cmd-that-would-break-with-spaces {} +

It looks like the pattern matching sections of the standard have
some problems with the use of the terms character and string.

2.13.1 says * matches "multiple characters", but 2.13.2 says it
matches "any string" in item 1 and then says it matches "a string
of zero or more characters" (i.e. any character string) in item 3.

> GNU sed even went as far as defining a new command for emptying
> the pattern space to work around that problem:
> 
> `z'
>      This command empties the content of pattern space.  It is usually
>      the same as `s/.*//', but is more efficient and works in the
>      presence of invalid multibyte sequences in the input stream.
>      POSIX mandates that such sequences are _not_ matched by `.', so
>      that there is no portable way to clear `sed''s buffers in the
>      middle of the script in most multibyte locales (including UTF-8
>      locales).
> 
> Is that claim (about it being a POSIX requirement) true?

I think it's true for regexec(), but not for sed.

(Perhaps we should add a REG_EILSEQ error return for regexec().)

> I'd expect the behaviour to be unspecified if the input is not
> text (as would be the case if there are invalid multi-byte
> sequences).

Exactly.

> See also
> http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8
> where we wondered whether grep -vx '.*' was required to report
> lines with invalid multi-byte sequences.

Unspecified, for the same reason as for sed.

> There was also a discussion earlier here about shells' ? and *
> on invalid byte sequences and most shells seem to match
> individual bytes from invalid multibyte sequences as one
> character (except for yash that won't deal with those at all)
> which seem to me like the safest thing to do.
> 
> What's the OpenGroup position on that?

2.13.1 is clear that ? matches a character.

The requirements for * are ambiguous because of the conflicting text
I pointed out above.

-- 
Geoff Clare <g.clare <at> opengroup.org>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

Information forwarded to bug-sed <at> gnu.org:
bug#21251; Package sed. (Sat, 28 Jan 2017 21:06:02 GMT) Full text and rfc822 format available.

Message #14 received at 21251 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 21251 <at> debbugs.gnu.org
Subject: Re: bug#21251: sed: POSIX and the z command
Date: Sat, 28 Jan 2017 21:04:25 +0000

Hello Stephane,

On Sat, Jan 28, 2017 at 10:01:55AM +0000, Stephane Chazelas wrote:
>It doesn't preclude the use of regexec. It just leaves the
>behaviour unspecified when the input is not text

Thanks for the clarification.

>I'd argue that for sequences of bytes that don't form valid
>characters, it would be nicer if "." or "[^anything]" matched
>each of the individual bytes.

Concretely, GNU sed uses several regex engines now (gnulib's dfa for
fast matching, then either glibc's or gnulib's RE for general matching 
and substitution).

To support this behaviour we'll need to ensure all of them behave in
the same reproducible and reliable manner (not impossible, just a TODO).

>You can still find the discussion using the NNTP interface. I
>attach the most relevant message (from Geoff Clare of the Austin
>group). I can send you the whole discussion as a mailbox file if
>you like.

I would appricate if you could send it to me - I'm interested
in multibyte processing for other gnu programs as well.

>From: Geoff Clare <gwc <at> opengroup.org>
>> GNU sed even went as far as defining a new command for emptying
>> the pattern space to work around that problem:
>> [...]
>> Is that claim (about it being a POSIX requirement) true?
>
>I think it's true for regexec(), but not for sed.
>
>(Perhaps we should add a REG_EILSEQ error return for regexec().)
>
>> I'd expect the behaviour to be unspecified if the input is not
>> text (as would be the case if there are invalid multi-byte
>> sequences).
>
>Exactly.

So the above somewhat confuses me (as my previous email):

Let's say I was to write a new simple 'sed' for POSIX systems.
If POSIX/OpenGroup encourages me (as a software writer for posix
systems) to use the POSIX regexec API, then implicitly my 'sed'
program wouldn't match invalid multibyte sequences.
But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
it means that in practical terms I shouldn't use POSIX API and
implement my own regex engine...

You compared it with LINE_MAX, but realistically, implementing support 
for lines longer than LINE_MAX is very different scale of effort than 
implementing a new regex  engine...

What am I missing ?

Thanks!
- assaf

Added tag(s) notabug and moreinfo. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sat, 28 Jan 2017 23:15:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-sed <at> gnu.org:
bug#21251; Package sed. (Tue, 31 Jan 2017 21:51:02 GMT) Full text and rfc822 format available.

Message #19 received at 21251 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 21251 <at> debbugs.gnu.org
Subject: Re: bug#21251: sed: POSIX and the z command
Date: Tue, 31 Jan 2017 21:49:55 +0000

2017-01-28 21:04:25 +0000, Assaf Gordon:
[...]
> >>I'd expect the behaviour to be unspecified if the input is not
> >>text (as would be the case if there are invalid multi-byte
> >>sequences).
> >
> >Exactly.
> 
> So the above somewhat confuses me (as my previous email):
> 
> Let's say I was to write a new simple 'sed' for POSIX systems.
> If POSIX/OpenGroup encourages me (as a software writer for posix
> systems) to use the POSIX regexec API, then implicitly my 'sed'
> program wouldn't match invalid multibyte sequences.
> But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
> it means that in practical terms I shouldn't use POSIX API and
> implement my own regex engine...
[...]

Just to clear what I think might be the source of the confusion,
this bug is not about GNU sed not being POSIX compliant in this
instance (it is compliant), but a documentation bug about the
claim that POSIX mandates s/.*// to not empty the pattern space
if it contains invalid characters being wrong. POSIX doesn't
mandate that, it mandates nothing of sed when the input is not
text.

The current sed behaviour is compliant. When the input is not
text, *anything* is compliant as POSIX leaves the behaviour of
sed unspecified then. That's an area not covered by POSIX,
you're on your own. In particular, you're free to ensure that
s/.*// empties the pattern space if you like.

That "simple sed" can do fgets() on a statically allocated
buffer of LINE_MAX length and use POSIX regexec() on it and
still be conformant.

Now, though that would be the subject of another "feature
request" bug and as you say one that would cover all the text
utilities, not just "sed", I (not POSIX) argue that it would be
better if individual bytes that don't form part of valid
characters would be treated as a character of their own rather
than pretend they're not there.

That could be done by adding a (non-POSIX) flag to regcomp() and
fnmatch() to enable that behaviour.

Or like python does in some cases, work with APIs that work on
some wchar_t* instead of char* but for the translation from
char* UTF-8 to wchar_t*, use a reserved range for byte values
that don't form part of valid characters.Like python that uses
code points U+DC80 to U+DCFF for bytes 0x80 to 0xff that don't
form part of valid characters (U+D800 to U+DFFF are not
characters, they are code points which are otherwise reserved
for UTF-16 encoding).

Without having to change the APIs, another approach (in UTF-8
locales) could be to preprocess the input to change for instance
a standalone 0x80 into the would-be UTF-8 encoding of U+DC80
before calling regexec() (for which at the moment "." matches on
even though it's not a character) and do the reverse on output.
That would have some performance impact though.

Note that at the moment there's some discrepency between GNU
tools on the treatment of the would-be UTF-8 encoding of those
D800-DF00 non-characters (the UTF-16 surrogate pairs).

For instance, some treat "ed b2 80" (the would-be-UTF-8-encoding
of DC80) as 0 character, some as 1, some as 3, some as 1 and 3
at the same time:

$ export C=$'\xed\xb2\x80'
$ bash -c '[[ $C = ??? ]]' && echo yes
yes

For bash (and zsh and ksh93), those 3 bytes don't form part of a
valid character, so are considered as characters which IMO is
the best thing to do.

$ printf %s "$C" | wc -m
0

That's not a character, so we print 0 (as required by POSIX I
beleive, wc is _not_ a text utility).

$ touch "$C"; find "$C" -name '*'
$ touch "$C"; find "$C" -name '?'
$ touch "$C"; find "$C" -name '???'
$

That file can't be matched by name!

$ printf '%s\n' "$C" | grep -xl .
(standard input)
$ printf '%s\n' "$C" | sed 's/^.$/yes/'
yes

But:

$ printf '%s\n' "$C" | grep -xPl .
$ printf '%s\n' "$C" | ./grep -Plx '.*'
(standard input)
$ printf '@%s@\n' "$C" | ./grep -Plx '@.*@'
$


Worse: it can be one character and three at the same time:

$ expr "$C" : '^.$'
3
$ printf '%s\n' "$C" | awk '/^.$/ {print length}'
3


(note that's on Linux-Mint 18.1, so not with the latest versions
of those utilities, one would have to check with the latest
versions).

(again, that's not a POSIX compliance issue for text utilities).

-- 
Stephane

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 09 Oct 2018 11:26:02 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 312 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #21251 sed: POSIX and the z command

GNU bug report logs - #21251
sed: POSIX and the z command