Thanks for the detailed feedback, Eric.

The POSIX spec. is, unfortunately, vague on this topic:

The definition of a line (which you quote) is complemented with the definition of an incomplete line:

A sequence of one or more non- <newline> characters at the end of the file.

So while the standard is aware of this possibility and gives it a name that suggests it is a kind of line, but something's missing, there is precious little behavior prescribed with respect to such incomplete lines.

So we have:


Beyond the "zero or more lines", the only restrictions placed on what constitutes a text file are:

If you interpret the word "lines" in the phrase "zero or more lines" to mean complete lines only (which is reasonable), then indeed any file that ends in an incomplete line is not a text file.

I really wish the spec. were more explicit about incomplete lines.

  If anything, the only
change I would make is have 'sed --posix' error out on non-text input,
to call attention to the user's attempt to feed non-posix-compliant data
to sed.

That is definitely an option, but perhaps intuitive understanding and historical practice / other implementations could be considered instead:


So, as a compromise, GNU sed --posix could treat files with an incomplete line as text files, as long as the incomplete line contains no NULs and contains at most getconf LINE_MAX - 1 characters.

Maybe the issue at hand is rarely of concern in the real world, but I've stumbled over it on several occasions when writing portable Sed commands (at least portable between Linux and macOS).
This issue and the infamous -i option incompatibility (which probably will never go away) are what get in the way of writing such commands.

Thanks,

Michael






On Apr 20, 2017, at 6:42 AM, Eric Blake <eblake@redhat.com> wrote:

tag 26574 notabug
thanks

On 04/19/2017 08:43 PM, Michael Klement wrote:
$ sed --version
sed (GNU sed) 4.4

The POSIX spec. <http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html> states:
"Whenever the pattern space is written to standard output or a named file, sed shall immediately follow it with a <newline>."

While GNU Sed's default behavior of preserving the trailing-newline status of the input's last line is defensible and can be helpful,
it should exhibit POSIX-compliant behavior when invoked with --posix.

POSIX also requires that input given to sed be text files:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
"The input files shall be text files."

And per the definition of text file, ALL input lines must have a
trailing newline in the first place:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
"3.403 Text File
A file that contains characters organized into zero or more lines. The
lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
in length, including the <newline> character. Although POSIX.1-2008 does
not distinguish between text files and binary files (see the ISO C
standard), many utilities only produce predictable or meaningful output
when operating on text files. The standard utilities that have such
restrictions always specify "text files" in their STDIN or INPUT FILES
sections."

"3.206 Line
A sequence of zero or more non- <newline> characters plus a terminating
<newline> character."

Input that does NOT end in a trailing newline is NOT a text file, and
therefore is NOT a POSIX-compliant use of sed, and therefore, sed
--posix need not do anything different with it because you are already
outside the bounds of what POSIX requires.

Therefore, I don't think you have a case for changing any behavior, at
least not on the grounds of appealing to POSIX, so I'm marking this as
not a bug, but feel free to continue discussion.  If anything, the only
change I would make is have 'sed --posix' error out on non-text input,
to call attention to the user's attempt to feed non-posix-compliant data
to sed.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org