Package: coreutils;
Reported by: John Kendall <john <at> capps.com>
Date: Mon, 1 Dec 2014 16:44:01 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
Message #33 received at 19240 <at> debbugs.gnu.org (full text, mbox):
From: John Kendall <john <at> capps.com> To: "19240 <at> debbugs.gnu.org" <19240 <at> debbugs.gnu.org> Cc: Bob Proulx <bob <at> proulx.com> Subject: Re: bug#19240: cut 8.22 adds newline Date: Thu, 4 Dec 2014 18:41:48 +0000
Bob Proulx wrote: > Eric Blake wrote: >> I'll leave it to other contributors to weigh in on whether omitting >> the final newline on output when it was missing on input is worth >> the complexity of a change. > >> Pádraig Brady wrote: >>> If we were just implementing now, I'd not output the extra '\n', >>> but changing at this stage needs to be carefully considered, >>> and with all the textutils, not just cut(1). >> >> I tend to go the opposite - producing text output, even on non-text >> input, is more likely to be useful when piping files to other utilities >> that don't handle non-text files as gracefully as the coreutils. But I >> definitely agree that it is not something we change lightly. > > I have these thoughts and comments to make. > > 1. I don't "like" input file lines that don't have trailing newlines. > It raises the question of whether the input is actually valid input. > It feels to me like any line missing a newline is incomplete. There > is likely to have been an error in the creation of it. Handling it > silently feels like ignoring the error. But raising an actual error > by exit code or by emitting a warning or error message feels too heavy > handed. I would lean toward assuming that any incomplete input line > is actually terminated by a newline as the lessor of the evils. > > 2. The suggesion for for handling *fields* that do not end with a > trailing newline differently from those that do doesn't make any sense > to me at all. What is a field? Is the newline part of the field? I > think not. Consider this. > > $ printf "one two" | awk '{print$1}' > one > > $ printf "one two" | awk '{print$2}' > two > > $ printf "one two\n" | awk '{print$1}' > one > > $ printf "one two\n" | awk '{print$2}' > two > > The newline is not part of field two. Otherwise printing it would > result in the second having two newlines output. > > $ printf "one two" | cut -d' ' -f1 > one > > $ printf "one two" | cut -d' ' -f2 > two > > $ printf "one two\n" | cut -d' ' -f1 > one > > $ printf "one two\n" | cut -d' ' -f2 > two > > Same thing for cut. The newline is not part of any of the fields. > The newline terminates the input line. The newline is not associated > with any of the delimited fields contained in an input line. > > For byte or character operations in the utils such as head -c those > are binary operations and should be interpreted strictly according to > the bytes. But not for cut -c which is column based. > > John Kendall wrote: >> # Solaris cut >> $ printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4 >> 1 >> 12 >> 123 >> 1234 >> 1234 >> 1234$ > > That is tickling non-portable behavior. I had a friend run some tests > on HP-UX and IBM AIX and the results there were different from > Solaris. Seems Solaris is already the unusual case. > > When looking count the "1234" lines carefully. Because HP-UX and > older AIX don't process the line without a trailing newline at all. > It is omitted there. Newer AIX appears to handle it like GNU. > > # uname -srm > HP-UX B.10.20 9000/785 > # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4 > 1 > 12 > 123 > 1234 > 1234 > # > > # uname -srm > HP-UX B.11.31 ia64 > # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4 > 1 > 12 > 123 > 1234 > 1234 > # > > # uname -s ; oslevel > AIX > 4.3.3.0 > # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4 > 1 > 12 > 123 > 1234 > 1234 > # > > # uname -s ; oslevel > AIX > 7.1.0.0 > # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4 > 1 > 12 > 123 > 1234 > 1234 > 1234 > # > > # head -1 /etc/motd ; uname -m > Compaq Tru64 UNIX V5.0A > alpha > # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4 > 1 > 12 > 123 > 1234 > 1234 > # > > # uname -s > Darwin > # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4 > 1 > 12 > 123 > 1234 > 1234 > 1234 > # > > Using input lines without a trailing newline is already a minefield of > portability problems. It depends upon details of the implementation. > > I think what Solaris cut must be doing is processing the emission of > characters across the line character by character. When it hits the > input newline it knows it is done and emits a newline itself and > starts again on a new line. When it hits EOF on the input it probably > just stops doing anything and exits itself without printing anything > more and therefore not emitting a newline. Likely just an accident of > implementation. > > This is what makes "lines" without a newline such an unportable thing > to count upon. It causes it to depend upon an implementation detail. > Different implementation might do different things. And in fact > different ones do actually do different things. This probably isn't > too widespread of an issue or it would have come up more often. And > more specific to the Solaris code port there would be similar problems > differently if trying to use other legacy Unix platforms. Best to > avoid the construct entirely for robust operation. > >> I came upon this while porting scripts from Solaris 10 to Centos 7. > > Can you share with us the specific construct that caused this to > arise? I have done a lot of script porting to and from HP-UX systems > and am curious as to the issue. > The construct in question if just for formatting the output of a script that compares disc files to what's in a database. echo "$FILE ===========================\c"| cut -c1-30 echo " matches ==========" The output on Solaris might look something like this (with monospaced font on a terminal all the "matches" line up): getDFL_info ================== matches ========== transWestim_msg ============== matches ========== selfBillDepotStoHan ========== matches ========== addSale_invoice ============== matches ========== buildInvoice ================= matches ========== addInvoice =================== matches ========== chgUnit ====================== matches ========== updSale_invoice ============== matches ========== The gnu output is: getDFL_info ================== matches ========== transWestim_msg ============== matches ========== selfBillDepotStoHan ========== matches ========== addSale_invoice ============== matches ========== buildInvoice ================= matches ========== addInvoice =================== matches ========== chgUnit ====================== matches ========== updSale_invoice ============== matches ========== This can be re-written, of course. (There is one corner case that Solaris's cut handled nicely that I have not been able to come up with a quick fix.) John > Bob
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.