#21251 - sed: POSIX and the z command

GNU bug report logs - #21251
sed: POSIX and the z command

Package: sed;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Thu, 13 Aug 2015 14:56:01 UTC

Severity: wishlist

Tags: moreinfo, notabug

View this message in rfc822 format

From: Stephane Chazelas <stephane.chazelas <at> gmail.com> To: Assaf Gordon <assafgordon <at> gmail.com> Cc: 21251 <at> debbugs.gnu.org Subject: bug#21251: sed: POSIX and the z command Date: Tue, 31 Jan 2017 21:49:55 +0000

2017-01-28 21:04:25 +0000, Assaf Gordon: [...] > >>I'd expect the behaviour to be unspecified if the input is not > >>text (as would be the case if there are invalid multi-byte > >>sequences). > > > >Exactly. > > So the above somewhat confuses me (as my previous email): > > Let's say I was to write a new simple 'sed' for POSIX systems. > If POSIX/OpenGroup encourages me (as a software writer for posix > systems) to use the POSIX regexec API, then implicitly my 'sed' > program wouldn't match invalid multibyte sequences. > But if OpenGroup wants me to match invalid multibyte sequences in 'sed'. > it means that in practical terms I shouldn't use POSIX API and > implement my own regex engine... [...] Just to clear what I think might be the source of the confusion, this bug is not about GNU sed not being POSIX compliant in this instance (it is compliant), but a documentation bug about the claim that POSIX mandates s/.*// to not empty the pattern space if it contains invalid characters being wrong. POSIX doesn't mandate that, it mandates nothing of sed when the input is not text. The current sed behaviour is compliant. When the input is not text, *anything* is compliant as POSIX leaves the behaviour of sed unspecified then. That's an area not covered by POSIX, you're on your own. In particular, you're free to ensure that s/.*// empties the pattern space if you like. That "simple sed" can do fgets() on a statically allocated buffer of LINE_MAX length and use POSIX regexec() on it and still be conformant. Now, though that would be the subject of another "feature request" bug and as you say one that would cover all the text utilities, not just "sed", I (not POSIX) argue that it would be better if individual bytes that don't form part of valid characters would be treated as a character of their own rather than pretend they're not there. That could be done by adding a (non-POSIX) flag to regcomp() and fnmatch() to enable that behaviour. Or like python does in some cases, work with APIs that work on some wchar_t* instead of char* but for the translation from char* UTF-8 to wchar_t*, use a reserved range for byte values that don't form part of valid characters.Like python that uses code points U+DC80 to U+DCFF for bytes 0x80 to 0xff that don't form part of valid characters (U+D800 to U+DFFF are not characters, they are code points which are otherwise reserved for UTF-16 encoding). Without having to change the APIs, another approach (in UTF-8 locales) could be to preprocess the input to change for instance a standalone 0x80 into the would-be UTF-8 encoding of U+DC80 before calling regexec() (for which at the moment "." matches on even though it's not a character) and do the reverse on output. That would have some performance impact though. Note that at the moment there's some discrepency between GNU tools on the treatment of the would-be UTF-8 encoding of those D800-DF00 non-characters (the UTF-16 surrogate pairs). For instance, some treat "ed b2 80" (the would-be-UTF-8-encoding of DC80) as 0 character, some as 1, some as 3, some as 1 and 3 at the same time: $ export C=$'\xed\xb2\x80' $ bash -c '[[ $C = ??? ]]' && echo yes yes For bash (and zsh and ksh93), those 3 bytes don't form part of a valid character, so are considered as characters which IMO is the best thing to do. $ printf %s "$C" | wc -m 0 That's not a character, so we print 0 (as required by POSIX I beleive, wc is _not_ a text utility). $ touch "$C"; find "$C" -name '*' $ touch "$C"; find "$C" -name '?' $ touch "$C"; find "$C" -name '???' $ That file can't be matched by name! $ printf '%s\n' "$C" | grep -xl . (standard input) $ printf '%s\n' "$C" | sed 's/^.$/yes/' yes But: $ printf '%s\n' "$C" | grep -xPl . $ printf '%s\n' "$C" | ./grep -Plx '.*' (standard input) $ printf '@%s@\n' "$C" | ./grep -Plx '@.*@' $ Worse: it can be one character and three at the same time: $ expr "$C" : '^.$' 3 $ printf '%s\n' "$C" | awk '/^.$/ {print length}' 3 (note that's on Linux-Mint 18.1, so not with the latest versions of those utilities, one would have to check with the latest versions). (again, that's not a POSIX compliance issue for text utilities). -- Stephane

This bug report was last modified 6 years and 312 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #21251 sed: POSIX and the z command

GNU bug report logs - #21251
sed: POSIX and the z command