Package: coreutils;
Reported by: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>
Date: Mon, 23 Nov 2015 21:03:02 UTC
Severity: normal
Tags: notabug
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Linda Walsh <coreutils <at> tlinx.org> To: Bob Proulx <bob <at> proulx.com> Cc: 22001 <at> debbugs.gnu.org, kim.macdonald <at> bccdc.ca Subject: bug#22001: Is it possible to tab separate concatenated files? Date: Thu, 26 Nov 2015 15:52:46 -0800
Bob Proulx wrote: > > That example shows a completely different problem. It shows that your > input plain text files have no terminating newline, making them > officially[/sic/] not plain text files but binary files. > Because every plain > text line in a file must be terminated with a newline. ---- That's only a recent POSIX definition. It's not related to real life. When I looked for a text file definition on google, nothing was mentioned about needing a newline on the last line -- except on 1 site -- and that site was clearly not talking about 'text' files, but Unix-text-record files w/each record terminated by a NL char. On a mac, txt files have records separated by 'CR', and on DOS/Win, txt files have txt records separated by CRLF. Wikipedia quotes the Unicode definition of txt files -- which doesn't require the POSIX txt-record definition. Also POSIX limits txt format to 'LINE_MAX' bytes -- notice it says 'bytes' and not characters. Yet a unicode line of 256 characters can easily exceed 1024 bytes. Yet never in the the history of the english language have lines been restricted to some number of bytes or characters. But one could note that the posix definition ONLY refers to files -- not streams of TEXT (whatever the character set). Specificially, note, that with 'TEXT COLUMNMS', describe text columns measured in column widths -- yet that conflicts with the definition Text File, in that textfiles use 'bytes' for a maximum line length, while text columns use 'characters' (which can be 1-4 bytes in unicode, UTF-8 or UTF-16 encoded). Of specific note -- "text" composed of characters, MUST support 'NUL' (as well as 'the audio bell' (control-g), the backspace (control-h), vertical tabs(U+000B), form-feed(U+000C). No standard definition outside POSIX include any of those characters -- because text characters are supposed to be readable and visible. But POSIX compatibility claims that Portable Character Set ( http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_01) must include those characters. The 'text'-files-must-have-NL' group ignores the POSIX 2008 definition of a portable character set -- but globs onto the implied definition of a text line as part of a 'text file'. But as already noted, POSIX has conflicting definitions about what text is. (Unicode measured in chars/columns or ascii (measured in bytes). But POSIX 2008 (same url as above) clearly states: A null character, NUL, which has all bits set to zero, shall be in the set of [supported] characters. In all plain-text definitions, it is mentioned that 'text' is is a set of displayable characters that can be broken into lines with the text-line separator definition. The last line of the file Needs No separation character at the end of the line as it doesn't need to be separated from anything. The GNU standard should not limit itself to an *arcane* (and not well known outside of POSIX-fans) definition of text, as it makes text files created before 2008, potentially incompatible. POSIX was supposed to be about portability... it certainly doesn't follow the internet-design-mime of "Accept input liberally, and generate output conservatively. > If they are > not then it isn't a text line. Must be binary. > --- Whereas I maintain that Newlines are required to break plain-text into records -- but not at the end-of-file, since there is no record following. > Why isn't there a newline at the end of the file? Fix that and all of > your problems and many others go away. > --- Didn't used to be a requirement -- it was added because of a broken interpretation of the posix standard. Please remember that a a posixified definition of 'X' (for any X), may not be the same as a real-live 'X'. In this case, we have a file containing *text* by the POSIX def, which you claim doesn't meet the POSIX definition of "text file". It's similar to Orwellian-speak -- redefining common terms to mean something else, so people don't notice the requirement change, then later telling others to clean-up their old input code/data that doesn't meet the newly created definition. Text files have been around alot longer than 8 years. Posix disqualifies most text files, for example, those created on the most widely laptop/desktop/commercial computerer OS in the world (Windows). I think what may be true is that 'POSIX text files' describe a data format that may not be how it is stored on disk. I find it very interesting in how 'NUL' is defined to be part of any POSIX text character set definition where such apps claim to support or process 'text'. It's sad to see the GNU utils becoming less flexible and more restricted over time -- much like the trend in computers to steer the public away from general purpose processing (and computers that can do such), to a tightly controlled, walled garden where consumers are only allowed to do what the manufacturer tells them to do. I suppose it's like the trend in US government that became federal law during the nixon years -- use of a product inconsistent with it's labeling is a violation of federal law). Whereas before, any usage that wasn't prohibited by local law was allowed. It is moving away from a free society with specific restrictions to a controlled society with specific, limited freedoms.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.