GNU bug report logs - #21251
sed: POSIX and the z command

Previous Next

Package: sed;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Thu, 13 Aug 2015 14:56:01 UTC

Severity: wishlist

Tags: moreinfo, notabug

Full log


View this message in rfc822 format

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 21251 <at> debbugs.gnu.org
Subject: bug#21251: sed: POSIX and the z command
Date: Tue, 31 Jan 2017 21:49:55 +0000
2017-01-28 21:04:25 +0000, Assaf Gordon:
[...]
> >>I'd expect the behaviour to be unspecified if the input is not
> >>text (as would be the case if there are invalid multi-byte
> >>sequences).
> >
> >Exactly.
> 
> So the above somewhat confuses me (as my previous email):
> 
> Let's say I was to write a new simple 'sed' for POSIX systems.
> If POSIX/OpenGroup encourages me (as a software writer for posix
> systems) to use the POSIX regexec API, then implicitly my 'sed'
> program wouldn't match invalid multibyte sequences.
> But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
> it means that in practical terms I shouldn't use POSIX API and
> implement my own regex engine...
[...]

Just to clear what I think might be the source of the confusion,
this bug is not about GNU sed not being POSIX compliant in this
instance (it is compliant), but a documentation bug about the
claim that POSIX mandates s/.*// to not empty the pattern space
if it contains invalid characters being wrong. POSIX doesn't
mandate that, it mandates nothing of sed when the input is not
text.

The current sed behaviour is compliant. When the input is not
text, *anything* is compliant as POSIX leaves the behaviour of
sed unspecified then. That's an area not covered by POSIX,
you're on your own. In particular, you're free to ensure that
s/.*// empties the pattern space if you like.

That "simple sed" can do fgets() on a statically allocated
buffer of LINE_MAX length and use POSIX regexec() on it and
still be conformant.

Now, though that would be the subject of another "feature
request" bug and as you say one that would cover all the text
utilities, not just "sed", I (not POSIX) argue that it would be
better if individual bytes that don't form part of valid
characters would be treated as a character of their own rather
than pretend they're not there.

That could be done by adding a (non-POSIX) flag to regcomp() and
fnmatch() to enable that behaviour.

Or like python does in some cases, work with APIs that work on
some wchar_t* instead of char* but for the translation from
char* UTF-8 to wchar_t*, use a reserved range for byte values
that don't form part of valid characters.Like python that uses
code points U+DC80 to U+DCFF for bytes 0x80 to 0xff that don't
form part of valid characters (U+D800 to U+DFFF are not
characters, they are code points which are otherwise reserved
for UTF-16 encoding).

Without having to change the APIs, another approach (in UTF-8
locales) could be to preprocess the input to change for instance
a standalone 0x80 into the would-be UTF-8 encoding of U+DC80
before calling regexec() (for which at the moment "." matches on
even though it's not a character) and do the reverse on output.
That would have some performance impact though.

Note that at the moment there's some discrepency between GNU
tools on the treatment of the would-be UTF-8 encoding of those
D800-DF00 non-characters (the UTF-16 surrogate pairs).

For instance, some treat "ed b2 80" (the would-be-UTF-8-encoding
of DC80) as 0 character, some as 1, some as 3, some as 1 and 3
at the same time:

$ export C=$'\xed\xb2\x80'
$ bash -c '[[ $C = ??? ]]' && echo yes
yes

For bash (and zsh and ksh93), those 3 bytes don't form part of a
valid character, so are considered as characters which IMO is
the best thing to do.

$ printf %s "$C" | wc -m
0

That's not a character, so we print 0 (as required by POSIX I
beleive, wc is _not_ a text utility).

$ touch "$C"; find "$C" -name '*'
$ touch "$C"; find "$C" -name '?'
$ touch "$C"; find "$C" -name '???'
$

That file can't be matched by name!

$ printf '%s\n' "$C" | grep -xl .
(standard input)
$ printf '%s\n' "$C" | sed 's/^.$/yes/'
yes

But:

$ printf '%s\n' "$C" | grep -xPl .
$ printf '%s\n' "$C" | ./grep -Plx '.*'
(standard input)
$ printf '@%s@\n' "$C" | ./grep -Plx '@.*@'
$


Worse: it can be one character and three at the same time:

$ expr "$C" : '^.$'
3
$ printf '%s\n' "$C" | awk '/^.$/ {print length}'
3


(note that's on Linux-Mint 18.1, so not with the latest versions
of those utilities, one would have to check with the latest
versions).

(again, that's not a POSIX compliance issue for text utilities).

-- 
Stephane




This bug report was last modified 6 years and 312 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.