Package: sed;
Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Date: Thu, 13 Aug 2015 14:56:01 UTC
Severity: wishlist
Tags: moreinfo, notabug
Message #11 received at 21251 <at> debbugs.gnu.org (full text, mbox):
From: Stephane Chazelas <stephane.chazelas <at> gmail.com> To: Assaf Gordon <assafgordon <at> gmail.com> Cc: 21251 <at> debbugs.gnu.org Subject: Re: bug#21251: sed: POSIX and the z command Date: Sat, 28 Jan 2017 10:01:55 +0000
[Message part 1 (text/plain, inline)]
2017-01-28 01:48:19 +0000, Assaf Gordon: [...] > On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote: > >[...] The behaviour > >of sed on non-text input is unspecified, so it doesn't require > >that . not match a byte that is not part of a valid character. > >[...] > >That POSIX requirement is true for regexec() but not for text > >utilities. > > I'm far from familiar with POSIX intricacies, but doesn't that sound a bit > strange ? I would naively think that POSIX would encourage POSIX-compliant > test utilities to use the system's native regexec implenentation, instead of > supporting slightl different semantics... Hi Assaf, It doesn't preclude the use of regexec. It just leaves the behaviour unspecified when the input is not text, like when lines are longer than LINE_MAX or when they contain NUL bytes or when they contain sequences of bytes not forming valid characters or when there are characters after the last newline character. Upon sequences of bytes that don't form valid characters, you're free to exit with an error, shut down the computer, or whatever you like, POSIX doesn't care. What POSIX tells the user of the POSIX API (that is script writers, sed user) is that they can't expect anything on non-text input. GNU sed already handles lines longer than LINE_MAX nicely, as well as lines containing NUL bytes or an unterminated last line. I'd argue that for sequences of bytes that don't form valid characters, it would be nicer if "." or "[^anything]" matched each of the individual bytes. It's what bash's * and ? and [!anything] fnmatch() patterns do (even though in that case POSIX seem to forbid it; that has been discussed on the austin group mailing list as well). > >See that discussion on the Austin Group mailing list: > >http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098 > > This link seems broken. Would you know where to find this discussion online > ? [...] Yes. They relied on gmane for the mailing list archive. The web interface has been discontinued (https://lars.ingebrigtsen.no/2016/07/28/the-end-of-gmane/), then taken over by somebody else, but not everything is back. https://lars.ingebrigtsen.no/2016/09/06/gmane-alive/comment-page-1/ You can still find the discussion using the NNTP interface. I attach the most relevant message (from Geoff Clare of the Austin group). I can send you the whole discussion as a mailbox file if you like. -- Stephane
[Message part 2 (message/rfc822, inline)]
From: Geoff Clare <gwc <at> opengroup.org> To: austin-group-l <at> opengroup.org Subject: Re: UTF-8 and non-characters Date: Wed, 1 Jul 2015 10:55:14 +0100Stephane Chazelas <stephane.chazelas <at> gmail.com> wrote, on 30 Jun 2015: > > Speaking of which, would a pseudo-UTF-8 locale where bytes that > don't form valid characters are mapped to a character like > U+FFFD (�) be POSIX compliant. > > Like c3 a9 is é, but c3 41 a9 is �A� > > or if not all mapped to a single character, mapped to dedicated > unassigned code points (0x7fffff80 to 0x7fffffff for instance)? > > For instance, above c3 41 a9 being <U+7fffffc3>A<U+7fffffa9> > > If allowed, would that not be desirable (I can see it > potentially be a problem when processing partial input)? I think this would cause inconsistency between btowc() and the various multi-byte to wide-character conversion functions. If btowc(0xc3) returns a wide character, then mbtowc() on c3 a9 ought to convert the c3 to that wide character and return 1, instead of converting c3 a9 to a wide é and returning 2. Conversely, if btowc(0xc3) returns WEOF, then mbtowc() on c3 41 a9 ought not to convert the c3 to a wide character. > A common source of bugs and security vulnerabilities with > UTF-8 is that fact that not all sequences of bytes map to > characters and in particular that they're not matched by RE's > "." or ".*" or fnmatch()'s ? or *. > > That's a common problem when you can't guarantee the input is > valid text for instance for arbitrary file names from the file > system. That's quite common when dealing with file names that > were written in a single-byte character set in UTF-8 locales. > > For instance, > > find . -name '*' > > With GNU find at least doesn't match on $'St\xe9phane.txt' > (Stéphane.txt in the iso8859-1 charset). > > An example of a more serious problem: > > find . ! -name "* *" -exec cmd-that-would-break-with-spaces {} + It looks like the pattern matching sections of the standard have some problems with the use of the terms character and string. 2.13.1 says * matches "multiple characters", but 2.13.2 says it matches "any string" in item 1 and then says it matches "a string of zero or more characters" (i.e. any character string) in item 3. > GNU sed even went as far as defining a new command for emptying > the pattern space to work around that problem: > > `z' > This command empties the content of pattern space. It is usually > the same as `s/.*//', but is more efficient and works in the > presence of invalid multibyte sequences in the input stream. > POSIX mandates that such sequences are _not_ matched by `.', so > that there is no portable way to clear `sed''s buffers in the > middle of the script in most multibyte locales (including UTF-8 > locales). > > Is that claim (about it being a POSIX requirement) true? I think it's true for regexec(), but not for sed. (Perhaps we should add a REG_EILSEQ error return for regexec().) > I'd expect the behaviour to be unspecified if the input is not > text (as would be the case if there are invalid multi-byte > sequences). Exactly. > See also > http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 > where we wondered whether grep -vx '.*' was required to report > lines with invalid multi-byte sequences. Unspecified, for the same reason as for sed. > There was also a discussion earlier here about shells' ? and * > on invalid byte sequences and most shells seem to match > individual bytes from invalid multibyte sequences as one > character (except for yash that won't deal with those at all) > which seem to me like the safest thing to do. > > What's the OpenGroup position on that? 2.13.1 is clear that ? matches a character. The requirements for * are ambiguous because of the conflicting text I pointed out above. -- Geoff Clare <g.clare <at> opengroup.org> The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.