GNU bug report logs - #22001
Is it possible to tab separate concatenated files?

Previous Next

Package: coreutils;

Reported by: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>

Date: Mon, 23 Nov 2015 21:03:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: "Macdonald, Kim - BCCDC" <kim.macdonald <at> bccdc.ca>
To: 'Assaf Gordon' <assafgordon <at> gmail.com>, Bob Proulx <bob <at> proulx.com>
Cc: "22001 <at> debbugs.gnu.org" <22001 <at> debbugs.gnu.org>
Subject: bug#22001: Is it possible to tab separate concatenated files?
Date: Mon, 23 Nov 2015 16:04:49 -0800
Thanks so much!!! I'll try these out now

Kim


-----Original Message-----
From: Assaf Gordon [mailto:assafgordon <at> gmail.com] 
Sent: November 23, 2015 3:48 PM
To: Bob Proulx; Macdonald, Kim - BCCDC
Cc: 22001 <at> debbugs.gnu.org
Subject: Re: bug#22001: Is it possible to tab separate concatenated files?

Hello Kim,

On 11/23/2015 06:09 PM, Bob Proulx wrote:
> Macdonald, Kim - BCCDC wrote:
>> For Example:
>> Concatenate the files like so:
>>> gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., 
>>> gi|452742846|ref|whole genome shotgun 
>>> gi|452742846|ref|sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT
>>> gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., 
>>> gi|452742846|ref|whole genome shotgun 
>>> gi|452742846|ref|sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGA
>>> gi|452742846|ref|CTGACGTACGTCGACTGACGTC
>>> gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., 
>>> gi|452742846|ref|whole genome shotgun 
>>> gi|452742846|ref|sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG
>>
> That example shows a completely different problem.  It shows that your 
> input plain text files have no terminating newline, making them 
> officially not plain text files but binary files.

Based on the content of your files, I'm guessing that you are working with mangled FASTA file.
In that case, it is possible that fixing the original files might be more efficient than trying to amend them later on.

The original FASTA files likely looked like so:

     >gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome shotgun sequence
     TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

And I'm also guessing that with some script you've removed the ">" prefix and joined the two lines into one.

First,
I suggest ensuring the original files have unix-style new-lines (LF) and not windows style (CR-LF) or Mac-style (CR).
The programs 'dos2unix' and 'mac2unix' would be able to fix it.
simply run the programs on each file, they will fix it inplace.
I would also recommend ensuring each file does end with a newline.


Second,
The FASTA id (the long text before your nucleotide sequence) contains spaces, and this will make downstream processing a bit of a pain.
I would recommend trimming the FASTA identifier and keeping only the first part (since it contains your IDs, you should have no problem recovering the organism name later).

Example:

   $ cat 1.fa
   >gi|452742846|ref|NZ_CAFD010000001.1|  Salmonella enterica subsp., whole genome shotgun sequence
   TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

   $ sed '/^>/s/ .*$//' 1.fa > 2.fa

   $ cat 2.fa
   >gi|452742846|ref|NZ_CAFD010000001.1|
   TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

Or do it inplace for all your FA file (be sure to have a backup, though):

    for i in *.fa ; do sed -i '/^>/s/ .*$//' $i ; done


Third,
To combine and convert the files into a table (i.e. 1st column=ID, 2nd column=sequence), then, assuming all your sequences are short and contained on one line, the following would work:

   $ cat 2.fa
   >gi|452742846|ref|NZ_CAFD010000001.1|
   TTTCAGCATATATATAGGCCATCATACATAGCCATATAT

   $ cat 3.fa
   >gi|452742846|ref|NZ_CAFD010000002.1|
   CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC

   $ cat *.fa | paste - - | sed 's/^>//' > final.txt

   $ cat final.txt
   gi|452742846|ref|NZ_CAFD010000001.1|	TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
   gi|452742846|ref|NZ_CAFD010000002.1|	CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC

the 'final.txt' will be an easy-to-work-with tabular file.


Fourth,
If you FASTA files contain multi-lined long sequences, like so:

    >gi|452742846|ref|NZ_CAFD010000002.1|
    CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTAC
    GTCGACTGACGTCTGTACACCACACGTTGTGACGAGCATCGACTAGCATCAG
    TTGAGCGACATCATCAGCGACGAGATCACGAGCACTAGCACTACGACTACGA

You might consider using a specialized tool to convert them to a table, such as:
  http://manpages.ubuntu.com/manpages/trusty/man1/fasta_formatter.1.html (*)
  or http://kirill-kryukov.com/study/tools/fasta-formatter/ .

Hope this helps,
  - assaf



(* shameless plug: I wrote fasta_formatter long ago)





This bug report was last modified 6 years and 213 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.