GNU bug report logs - #22001
Is it possible to tab separate concatenated files?

Previous Next

Package: coreutils;

Reported by: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>

Date: Mon, 23 Nov 2015 21:03:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Linda Walsh <coreutils <at> tlinx.org>
To: Bob Proulx <bob <at> proulx.com>
Cc: 22001 <at> debbugs.gnu.org, kim.macdonald <at> bccdc.ca
Subject: bug#22001: Is it possible to tab separate concatenated files?
Date: Thu, 26 Nov 2015 15:52:46 -0800



Bob Proulx wrote:
>
> That example shows a completely different problem.  It shows that your
> input plain text files have no terminating newline, making them
> officially[/sic/] not plain text files but binary files.  

> Because every plain
> text line in a file must be terminated with a newline.
----
   That's only a recent POSIX definition.  It's not related to
real life.  When I looked for a text file definition on google, nothing
was mentioned about needing a newline on the last line -- except on
1 site -- and that site was clearly not talking about 'text' files, but
Unix-text-record files w/each record terminated by a NL char.

   On a mac, txt files have records separated by 'CR', and on DOS/Win,
txt files have txt records separated by CRLF.  Wikipedia quotes the
Unicode definition of txt files -- which doesn't require the POSIX
txt-record definition.  Also POSIX limits txt format to 'LINE_MAX' bytes --
notice it says 'bytes' and not characters.  Yet a unicode line of 256
characters can easily exceed 1024 bytes.  Yet never in the the history of
the english language have lines been restricted to some number of bytes or
characters.  But one could note that the posix definition ONLY refers
to files -- not streams of TEXT (whatever the character set). 

   Specificially, note, that with 'TEXT COLUMNMS', describe text
columns measured in column widths -- yet that conflicts with the
definition Text File, in that textfiles use 'bytes' for a maximum
line length, while text columns use 'characters' (which can be
1-4 bytes in unicode, UTF-8 or UTF-16 encoded). 

   Of specific note -- "text" composed of characters, MUST
support 'NUL' (as well as 'the audio bell' (control-g), the
backspace (control-h), vertical tabs(U+000B), form-feed(U+000C).

   No standard definition outside POSIX include any of those
characters -- because text characters are supposed to be readable
and visible.  But POSIX compatibility claims that Portable
Character Set
( 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_01)
must include those characters.

   The 'text'-files-must-have-NL' group ignores the POSIX 2008 
definition of
a portable character set -- but globs onto the implied definition
of a text line as part of a 'text file'.

   But as already noted, POSIX has conflicting definitions about what text
is.  (Unicode measured in chars/columns or ascii (measured in bytes).  But
POSIX 2008 (same url as above) clearly states:
A null character, NUL, which has all bits set to zero, shall be in the 
set of [supported] characters.

   In all plain-text definitions, it is mentioned that 'text' is is a
set of displayable characters that can be broken into lines with the
text-line separator definition.  The last line of the file Needs No
separation character at the end of the line as it doesn't need to be
separated from anything.

   The GNU standard should not limit itself to an *arcane* (and not well
known outside of POSIX-fans) definition of text, as it makes text files
created before 2008, potentially incompatible.

   POSIX was supposed to be about portability... it certainly doesn't
follow the internet-design-mime of "Accept input liberally, and generate
output conservatively.

> If they are
> not then it isn't a text line.  Must be binary.
>   
---
   Whereas I maintain that Newlines are required to break plain-text
into records -- but not at the end-of-file, since there is no record
following.


> Why isn't there a newline at the end of the file?  Fix that and all of
> your problems and many others go away.
>   
---
   Didn't used to be a requirement -- it was added because of a broken
interpretation of the posix standard.  Please remember that a a posixified
definition of 'X' (for any X), may not be the same as a real-live 'X'.

   In this case,  we have a file containing *text* by the POSIX
def, which you claim doesn't meet the POSIX definition of "text file".
    It's similar to Orwellian-speak -- redefining common terms to mean
something else, so people don't notice the requirement change, then later
telling others to clean-up their old input code/data that doesn't
meet the newly created definition.  Text files have been around alot
longer than 8 years.  Posix disqualifies most text files, for example,
those created on the most widely laptop/desktop/commercial computerer OS
in the world (Windows). 

   I think what may be true is that 'POSIX text files' describe a data
format that may not be how it is stored on disk.  I find it very
interesting in how 'NUL' is defined to be part of any POSIX text character
set definition where such apps claim to support or process 'text'.

   It's sad to see the GNU utils becoming less flexible and more
restricted over time -- much like the trend in computers to steer
the public away from general purpose processing (and computers that
can do such), to a tightly controlled, walled garden where consumers
are only allowed to do what the manufacturer tells them to do.

   I suppose it's like the trend in US government that became federal law
during the nixon years -- use of a product inconsistent with it's
labeling is a violation of federal law).  Whereas before, any usage that
wasn't prohibited by local law was allowed.  It is moving away
from a free society with specific restrictions to a controlled society
with specific, limited freedoms.















This bug report was last modified 6 years and 213 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.