GNU bug report logs - #22001
Is it possible to tab separate concatenated files?

Previous Next

Package: coreutils;

Reported by: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>

Date: Mon, 23 Nov 2015 21:03:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22001 in the body.
You can then email your comments to 22001 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Mon, 23 Nov 2015 21:03:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 23 Nov 2015 21:03:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>
To: "'bug-coreutils <at> gnu.org'" <bug-coreutils <at> gnu.org>
Subject: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 12:50:12 -0800
[Message part 1 (text/plain, inline)]
Hi!

I'm just looking at the options for the cat command - I see there's a way to ignore tabs when they exist - but is there a way to tab separate the files you're concatenating with the cat command?

Thanks,
Kim



[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Mon, 23 Nov 2015 22:03:01 GMT) Full text and rfc822 format available.

Message #8 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>, 22001 <at> debbugs.gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 17:02:44 -0500
tag 22001 notabug
close 22001
stop

Hello Kim,

On 11/23/2015 03:50 PM, Macdonald, Kim - BCCDC wrote:
> I’m just looking at the options for the cat command – I see there’s a
> way to ignore tabs when they exist – but is there a way to tab
> separate the files you’re concatenating with the cat command?

It is unclear (to me) what you're trying to achieve - could provide a bit more details (perhaps a short example) ?

If you have a file (one file) with spaces and you wish to convert them to tabs, consider the 'expand' command (then pipe to 'cat' if needed).

If you have multiple files and you wish to print them side-by-side, separated by tabs (as opposed to one-after-the-other, as with 'cat'),
consider using 'paste':

  $ cat 1.txt
  a
  b
  c
  d

  $ cat 2.txt
  1
  2
  3
  4

  $ cat 3.txt
  w
  x
  y
  z

  $ paste 1.txt 2.txt 3.txt
  a	1	w
  b	2	x
  c	3	y
  d	4	z

regards,
 - assaf





Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Mon, 23 Nov 2015 22:46:02 GMT) Full text and rfc822 format available.

Message #11 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>, 22001 <at> debbugs.gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 17:46:31 -0500
Correcting myself:

On 11/23/2015 05:02 PM, Assaf Gordon wrote:
> If you have a file (one file) with spaces and you wish to convert
> them to tabs, consider the 'expand' command (then pipe to 'cat' if
> needed).
>

"unexpand" will convert spaces to tabs,
"expand" will convert tabs to spaces.





Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Mon, 23 Nov 2015 22:54:02 GMT) Full text and rfc822 format available.

Message #14 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>
To: 'Assaf Gordon' <assafgordon <at> gmail.com>, "22001 <at> debbugs.gnu.org"
 <22001 <at> debbugs.gnu.org>
Subject: RE: bug#22001: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 14:52:52 -0800
Thanks Assaf, 

Sorry for the confusion - I wanted to add a tab (or even a new line) after each file that was concatenated. Actually a new line may be better. 

For Example:
Concatenate the files like so:
>gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT
>gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., whole genome shotgun sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC
>gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., whole genome shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG

Right now - Just using cat, they look , like:
>gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT>gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., whole genome shotgun sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC>gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., whole genome shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG


Kim



-----Original Message-----
From: Assaf Gordon [mailto:assafgordon <at> gmail.com] 
Sent: November 23, 2015 2:03 PM
To: Macdonald, Kim - BCCDC; 22001 <at> debbugs.gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?

tag 22001 notabug
close 22001
stop

Hello Kim,

On 11/23/2015 03:50 PM, Macdonald, Kim - BCCDC wrote:
> I'm just looking at the options for the cat command - I see there's a 
> way to ignore tabs when they exist - but is there a way to tab 
> separate the files you're concatenating with the cat command?

It is unclear (to me) what you're trying to achieve - could provide a bit more details (perhaps a short example) ?

If you have a file (one file) with spaces and you wish to convert them to tabs, consider the 'expand' command (then pipe to 'cat' if needed).

If you have multiple files and you wish to print them side-by-side, separated by tabs (as opposed to one-after-the-other, as with 'cat'), consider using 'paste':

   $ cat 1.txt
   a
   b
   c
   d

   $ cat 2.txt
   1
   2
   3
   4

   $ cat 3.txt
   w
   x
   y
   z

   $ paste 1.txt 2.txt 3.txt
   a	1	w
   b	2	x
   c	3	y
   d	4	z

regards,
  - assaf





Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Mon, 23 Nov 2015 23:10:01 GMT) Full text and rfc822 format available.

Message #17 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>
Cc: 22001 <at> debbugs.gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 16:09:32 -0700
Macdonald, Kim - BCCDC wrote:
> Sorry for the confusion - I wanted to add a tab (or even a new line)
> after each file that was concatenated. Actually a new line may be
> better.
>
> For Example:
> Concatenate the files like so:
> >gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT
> >gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., whole genome shotgun sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC
> >gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., whole genome shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG
> 
> Right now - Just using cat, they look , like:
> >gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT>gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., whole genome shotgun sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC>gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., whole genome shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG

That example shows a completely different problem.  It shows that your
input plain text files have no terminating newline, making them
officially not plain text files but binary files.  Because every plain
text line in a file must be terminated with a newline.  If they are
not then it isn't a text line.  Must be binary.

Why isn't there a newline at the end of the file?  Fix that and all of
your problems and many others go away.

Getting ahead of things 1...

If you just can't fix the lack of a newline at the end of those files
then you must handle it explicitly.

  for f in *.txt; do
    cat "$f"
    echo
  done

Getting ahead of things 2...

Sometimes people just want a separator between files.
Actually 'tail' will already do this rather well.

  tail -n+0 *.txt
  ==> 1.txt <==
  foo

  ==> 2.txt <==
  bar

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Mon, 23 Nov 2015 23:48:01 GMT) Full text and rfc822 format available.

Message #20 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Bob Proulx <bob <at> proulx.com>,
 "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>
Cc: 22001 <at> debbugs.gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 18:47:49 -0500
Hello Kim,

On 11/23/2015 06:09 PM, Bob Proulx wrote:
> Macdonald, Kim - BCCDC wrote:
>> For Example:
>> Concatenate the files like so:
>>> gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT
>>> gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., whole genome shotgun sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC
>>> gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., whole genome shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG
>>
> That example shows a completely different problem.  It shows that your
> input plain text files have no terminating newline, making them
> officially not plain text files but binary files.

Based on the content of your files, I'm guessing that you are working with mangled FASTA file.
In that case, it is possible that fixing the original files might be more efficient than trying to amend them later on.

The original FASTA files likely looked like so:

    >gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequence
    TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

And I'm also guessing that with some script you've removed the ">" prefix and joined the two lines into one.

First,
I suggest ensuring the original files have unix-style new-lines (LF) and not windows style (CR-LF) or Mac-style (CR).
The programs 'dos2unix' and 'mac2unix' would be able to fix it.
simply run the programs on each file, they will fix it inplace.
I would also recommend ensuring each file does end with a newline.


Second,
The FASTA id (the long text before your nucleotide sequence) contains spaces, and this will make downstream processing a bit of a pain.
I would recommend trimming the FASTA identifier and keeping only the first part (since it contains your IDs, you should have no problem
recovering the organism name later).

Example:

  $ cat 1.fa
  >gi|452742846|ref|NZ_CAFD010000001.1|  Salmonella enterica subsp., whole genome shotgun sequence
  TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

  $ sed '/^>/s/ .*$//' 1.fa > 2.fa

  $ cat 2.fa
  >gi|452742846|ref|NZ_CAFD010000001.1|
  TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

Or do it inplace for all your FA file (be sure to have a backup, though):

   for i in *.fa ; do sed -i '/^>/s/ .*$//' $i ; done


Third,
To combine and convert the files into a table (i.e. 1st column=ID, 2nd column=sequence),
then, assuming all your sequences are short and contained on one line, the following would work:

  $ cat 2.fa
  >gi|452742846|ref|NZ_CAFD010000001.1|
  TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

  $ cat 3.fa
  >gi|452742846|ref|NZ_CAFD010000002.1|
  CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC

  $ cat *.fa | paste - - | sed 's/^>//' > final.txt

  $ cat final.txt
  gi|452742846|ref|NZ_CAFD010000001.1|	TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
  gi|452742846|ref|NZ_CAFD010000002.1|	CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC

the 'final.txt' will be an easy-to-work-with tabular file.


Fourth,
If you FASTA files contain multi-lined long sequences, like so:

   >gi|452742846|ref|NZ_CAFD010000002.1|
   CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTAC
   GTCGACTGACGTCTGTACACCACACGTTGTGACGAGCATCGACTAGCATCAG
   TTGAGCGACATCATCAGCGACGAGATCACGAGCACTAGCACTACGACTACGA

You might consider using a specialized tool to convert them to a table, such as:
 http://manpages.ubuntu.com/manpages/trusty/man1/fasta_formatter.1.html (*)
 or http://kirill-kryukov.com/study/tools/fasta-formatter/ .

Hope this helps,
 - assaf



(* shameless plug: I wrote fasta_formatter long ago)





Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Tue, 24 Nov 2015 00:06:01 GMT) Full text and rfc822 format available.

Message #23 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>
To: 'Assaf Gordon' <assafgordon <at> gmail.com>, Bob Proulx <bob <at> proulx.com>
Cc: "22001 <at> debbugs.gnu.org" <22001 <at> debbugs.gnu.org>
Subject: RE: bug#22001: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 16:04:49 -0800
Thanks so much!!! I'll try these out now

Kim


-----Original Message-----
From: Assaf Gordon [mailto:assafgordon <at> gmail.com] 
Sent: November 23, 2015 3:48 PM
To: Bob Proulx; Macdonald, Kim - BCCDC
Cc: 22001 <at> debbugs.gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?

Hello Kim,

On 11/23/2015 06:09 PM, Bob Proulx wrote:
> Macdonald, Kim - BCCDC wrote:
>> For Example:
>> Concatenate the files like so:
>>> gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., 
>>> gi|452742846|ref|whole genome shotgun 
>>> gi|452742846|ref|sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT
>>> gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., 
>>> gi|452742846|ref|whole genome shotgun 
>>> gi|452742846|ref|sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGA
>>> gi|452742846|ref|CTGACGTACGTCGACTGACGTC
>>> gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., 
>>> gi|452742846|ref|whole genome shotgun 
>>> gi|452742846|ref|sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG
>>
> That example shows a completely different problem.  It shows that your 
> input plain text files have no terminating newline, making them 
> officially not plain text files but binary files.

Based on the content of your files, I'm guessing that you are working with mangled FASTA file.
In that case, it is possible that fixing the original files might be more efficient than trying to amend them later on.

The original FASTA files likely looked like so:

     >gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequence
     TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

And I'm also guessing that with some script you've removed the ">" prefix and joined the two lines into one.

First,
I suggest ensuring the original files have unix-style new-lines (LF) and not windows style (CR-LF) or Mac-style (CR).
The programs 'dos2unix' and 'mac2unix' would be able to fix it.
simply run the programs on each file, they will fix it inplace.
I would also recommend ensuring each file does end with a newline.


Second,
The FASTA id (the long text before your nucleotide sequence) contains spaces, and this will make downstream processing a bit of a pain.
I would recommend trimming the FASTA identifier and keeping only the first part (since it contains your IDs, you should have no problem recovering the organism name later).

Example:

   $ cat 1.fa
   >gi|452742846|ref|NZ_CAFD010000001.1|  Salmonella enterica subsp., whole genome shotgun sequence
   TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

   $ sed '/^>/s/ .*$//' 1.fa > 2.fa

   $ cat 2.fa
   >gi|452742846|ref|NZ_CAFD010000001.1|
   TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

Or do it inplace for all your FA file (be sure to have a backup, though):

    for i in *.fa ; do sed -i '/^>/s/ .*$//' $i ; done


Third,
To combine and convert the files into a table (i.e. 1st column=ID, 2nd column=sequence), then, assuming all your sequences are short and contained on one line, the following would work:

   $ cat 2.fa
   >gi|452742846|ref|NZ_CAFD010000001.1|
   TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

   $ cat 3.fa
   >gi|452742846|ref|NZ_CAFD010000002.1|
   CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC

   $ cat *.fa | paste - - | sed 's/^>//' > final.txt

   $ cat final.txt
   gi|452742846|ref|NZ_CAFD010000001.1|	TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
   gi|452742846|ref|NZ_CAFD010000002.1|	CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC

the 'final.txt' will be an easy-to-work-with tabular file.


Fourth,
If you FASTA files contain multi-lined long sequences, like so:

    >gi|452742846|ref|NZ_CAFD010000002.1|
    CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTAC
    GTCGACTGACGTCTGTACACCACACGTTGTGACGAGCATCGACTAGCATCAG
    TTGAGCGACATCATCAGCGACGAGATCACGAGCACTAGCACTACGACTACGA

You might consider using a specialized tool to convert them to a table, such as:
  http://manpages.ubuntu.com/manpages/trusty/man1/fasta_formatter.1.html (*)
  or http://kirill-kryukov.com/study/tools/fasta-formatter/ .

Hope this helps,
  - assaf



(* shameless plug: I wrote fasta_formatter long ago)





Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Thu, 26 Nov 2015 23:54:03 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Linda Walsh <coreutils <at> tlinx.org>
To: Bob Proulx <bob <at> proulx.com>
Cc: 22001 <at> debbugs.gnu.org, "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>,
 bug-coreutils <at> gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?
Date: Thu, 26 Nov 2015 15:52:46 -0800



Bob Proulx wrote:
>
> That example shows a completely different problem.  It shows that your
> input plain text files have no terminating newline, making them
> officially[/sic/] not plain text files but binary files.  

> Because every plain
> text line in a file must be terminated with a newline.
----
   That's only a recent POSIX definition.  It's not related to
real life.  When I looked for a text file definition on google, nothing
was mentioned about needing a newline on the last line -- except on
1 site -- and that site was clearly not talking about 'text' files, but
Unix-text-record files w/each record terminated by a NL char.

   On a mac, txt files have records separated by 'CR', and on DOS/Win,
txt files have txt records separated by CRLF.  Wikipedia quotes the
Unicode definition of txt files -- which doesn't require the POSIX
txt-record definition.  Also POSIX limits txt format to 'LINE_MAX' bytes --
notice it says 'bytes' and not characters.  Yet a unicode line of 256
characters can easily exceed 1024 bytes.  Yet never in the the history of
the english language have lines been restricted to some number of bytes or
characters.  But one could note that the posix definition ONLY refers
to files -- not streams of TEXT (whatever the character set). 

   Specificially, note, that with 'TEXT COLUMNMS', describe text
columns measured in column widths -- yet that conflicts with the
definition Text File, in that textfiles use 'bytes' for a maximum
line length, while text columns use 'characters' (which can be
1-4 bytes in unicode, UTF-8 or UTF-16 encoded). 

   Of specific note -- "text" composed of characters, MUST
support 'NUL' (as well as 'the audio bell' (control-g), the
backspace (control-h), vertical tabs(U+000B), form-feed(U+000C).

   No standard definition outside POSIX include any of those
characters -- because text characters are supposed to be readable
and visible.  But POSIX compatibility claims that Portable
Character Set
( 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_01)
must include those characters.

   The 'text'-files-must-have-NL' group ignores the POSIX 2008 
definition of
a portable character set -- but globs onto the implied definition
of a text line as part of a 'text file'.

   But as already noted, POSIX has conflicting definitions about what text
is.  (Unicode measured in chars/columns or ascii (measured in bytes).  But
POSIX 2008 (same url as above) clearly states:
A null character, NUL, which has all bits set to zero, shall be in the 
set of [supported] characters.

   In all plain-text definitions, it is mentioned that 'text' is is a
set of displayable characters that can be broken into lines with the
text-line separator definition.  The last line of the file Needs No
separation character at the end of the line as it doesn't need to be
separated from anything.

   The GNU standard should not limit itself to an *arcane* (and not well
known outside of POSIX-fans) definition of text, as it makes text files
created before 2008, potentially incompatible.

   POSIX was supposed to be about portability... it certainly doesn't
follow the internet-design-mime of "Accept input liberally, and generate
output conservatively.

> If they are
> not then it isn't a text line.  Must be binary.
>   
---
   Whereas I maintain that Newlines are required to break plain-text
into records -- but not at the end-of-file, since there is no record
following.


> Why isn't there a newline at the end of the file?  Fix that and all of
> your problems and many others go away.
>   
---
   Didn't used to be a requirement -- it was added because of a broken
interpretation of the posix standard.  Please remember that a a posixified
definition of 'X' (for any X), may not be the same as a real-live 'X'.

   In this case,  we have a file containing *text* by the POSIX
def, which you claim doesn't meet the POSIX definition of "text file".
    It's similar to Orwellian-speak -- redefining common terms to mean
something else, so people don't notice the requirement change, then later
telling others to clean-up their old input code/data that doesn't
meet the newly created definition.  Text files have been around alot
longer than 8 years.  Posix disqualifies most text files, for example,
those created on the most widely laptop/desktop/commercial computerer OS
in the world (Windows). 

   I think what may be true is that 'POSIX text files' describe a data
format that may not be how it is stored on disk.  I find it very
interesting in how 'NUL' is defined to be part of any POSIX text character
set definition where such apps claim to support or process 'text'.

   It's sad to see the GNU utils becoming less flexible and more
restricted over time -- much like the trend in computers to steer
the public away from general purpose processing (and computers that
can do such), to a tightly controlled, walled garden where consumers
are only allowed to do what the manufacturer tells them to do.

   I suppose it's like the trend in US government that became federal law
during the nixon years -- use of a product inconsistent with it's
labeling is a violation of federal law).  Whereas before, any usage that
wasn't prohibited by local law was allowed.  It is moving away
from a free society with specific restrictions to a controlled society
with specific, limited freedoms.















Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Thu, 26 Nov 2015 23:54:04 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Fri, 27 Nov 2015 03:29:02 GMT) Full text and rfc822 format available.

Message #32 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Linda Walsh <coreutils <at> tlinx.org>, Bob Proulx <bob <at> proulx.com>
Cc: 22001 <at> debbugs.gnu.org, kim.macdonald <at> bccdc.ca
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?
Date: Thu, 26 Nov 2015 20:28:13 -0700
[Message part 1 (text/plain, inline)]
On 11/26/2015 04:52 PM, Linda Walsh wrote:

>> Because every plain
>> text line in a file must be terminated with a newline.
> ----
>    That's only a recent POSIX definition.  It's not related to
> real life.  When I looked for a text file definition on google, nothing
> was mentioned about needing a newline on the last line -- except on
> 1 site -- and that site was clearly not talking about 'text' files, but
> Unix-text-record files w/each record terminated by a NL char.
> 

Quit spreading FUD about POSIX.  That definition of text file is NOT a
recent invention; even back in POSIX 2001 the definition read:

3.392 Text File

A file that contains characters organized into one or more lines. The
lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
in length, including the <newline>. Although IEEE Std 1003.1-2001 does
not distinguish between text files and binary files (see the ISO C
standard), many utilities only produce predictable or meaningful output
when operating on text files. The standard utilities that have such
restrictions always specify "text files" in their STDIN or INPUT FILES
sections.
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html

That was POSIX Issue 6; the more recent POSIX Issue 7 corrected the
definition to also allow a completely empty file to be considered as a
text file.  But the point is that POSIX has always required a text file
to end in a newline.

>    On a mac, txt files have records separated by 'CR', and on DOS/Win,
> txt files have txt records separated by CRLF.

And those systems aren't POSIX.  So they aren't relevant to a discussion
about POSIX.


>> Why isn't there a newline at the end of the file?  Fix that and all of
>> your problems and many others go away.
>>   
> ---
>    Didn't used to be a requirement -- it was added because of a broken
> interpretation of the posix standard.  Please remember that a a posixified
> definition of 'X' (for any X), may not be the same as a real-live 'X'.

No, it has ALWAYS been a problem.  Even 40 years ago, before POSIX was
invented, the only PORTABLE way to use programs like sed was to use it
on text files - namely, files where no line exceeded LINE_MAX bytes,
where no lines contained NUL bytes, and where ALL lines ended in
newline.  Because there were vendor implementations of sed (not GNU
coreutils, mind you, but other vendors) that really were hardcoded to
some rather small limits, and understandably so in a day when computers
did not have as much memory as they do today.  POSIX just standardized
existing practice on what formed a text file, when it came to existing
Unix systems at that time.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#22001; Package coreutils. (Fri, 27 Nov 2015 08:23:02 GMT) Full text and rfc822 format available.

Message #35 received at 22001 <at> debbugs.gnu.org (full text, mbox):

From: Erik Auerswald <auerswal <at> unix-ag.uni-kl.de>
To: Eric Blake <eblake <at> redhat.com>
Cc: 22001 <at> debbugs.gnu.org, kim.macdonald <at> bccdc.ca,
 Linda Walsh <coreutils <at> tlinx.org>, Bob Proulx <bob <at> proulx.com>
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?
Date: Fri, 27 Nov 2015 09:22:05 +0100
Hi,

On Thu, Nov 26, 2015 at 08:28:13PM -0700, Eric Blake wrote:
> On 11/26/2015 04:52 PM, Linda Walsh wrote:
> 
> >> Because every plain
> >> text line in a file must be terminated with a newline.
> > ----
> >    That's only a recent POSIX definition.  It's not related to
> > real life.  When I looked for a text file definition on google, nothing
> > was mentioned about needing a newline on the last line -- except on
> > 1 site -- and that site was clearly not talking about 'text' files, but
> > Unix-text-record files w/each record terminated by a NL char.
> > 
> 
> Quit spreading FUD about POSIX.  That definition of text file is NOT a
> recent invention; even back in POSIX 2001 the definition read:
> 
> 3.392 Text File
> 
> A file that contains characters organized into one or more lines. The
> lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
> in length, including the <newline>. Although IEEE Std 1003.1-2001 does
> not distinguish between text files and binary files (see the ISO C
> standard), many utilities only produce predictable or meaningful output
> when operating on text files. The standard utilities that have such
> restrictions always specify "text files" in their STDIN or INPUT FILES
> sections.
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html

At least the definition of a "line" is needed as well to understand the
above (from the same URL):

 3.205 Line

 A sequence of zero or more non- <newline>s plus a terminating <newline>.

[...]
> 
> No, it has ALWAYS been a problem.  Even 40 years ago, before POSIX was
> invented, the only PORTABLE way to use programs like sed was to use it
> on text files [...]

The sed of Solaris 10 ignores trailing text after the last line, that
is after the last newline. I am quite sure this behavior has been in
older Solaris and SunOS versions as well.

Best regards,
Erik
-- 
http://www.unix-ag.uni-kl.de/~auerswal/




Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 24 Oct 2018 21:22:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 22001 <at> debbugs.gnu.org and "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 24 Oct 2018 21:22:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 22 Nov 2018 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 213 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.