GNU bug report logs - #36130
split bug

Previous Next

Package: coreutils;

Reported by: Heather Wick <heather.c.wick <at> gmail.com>

Date: Fri, 7 Jun 2019 18:47:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #11 received at 36130 <at> debbugs.gnu.org (full text, mbox):

From: Heather Wick <heather.c.wick <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 36130 <at> debbugs.gnu.org
Subject: Re: bug#36130: split bug
Date: Fri, 7 Jun 2019 21:48:44 -0400
[Message part 1 (text/plain, inline)]
Hi,
Yes, sorry, I should have specified that I already checked that the
original fastq files are indeed paired and sorted with the same number of
lines and same starting/ending IDs, narrowing down the issue to a problem
with split.
~ Heather


(base) [hwick <at> zappalogin ~]$ zcat  MH2_R2.fastq.gz | wc -l

3778103832

(base) [hwick <at> zappalogin ~]$ zcat  MH2_R1.fastq.gz | wc -l

3778103832


(base) [hwick <at> zappalogin test_2019]$ zcat MH2_R1.fastq.gz | head -n8 | grep
^@

@A00197:48:HF2GWDMXX:1:1101:1741:1000 1:N:0:GATCAG+TCTTTCCC

@A00197:48:HF2GWDMXX:1:1101:2754:1000 1:N:0:GATCAG+TCTTTCCC

(base) [hwick <at> zappalogin test_2019]$ zcat MH2_R2.fastq.gz | head -n8 | grep
^@

@A00197:48:HF2GWDMXX:1:1101:1741:1000 2:N:0:GATCAG+TCTTTCCC

@A00197:48:HF2GWDMXX:1:1101:2754:1000 2:N:0:GATCAG+TCTTTCCC


(base) [hwick <at> zappalogin test_2019]$ zcat MH2_R1.fastq.gz | tail -n8 | grep
^@

@E00489:288:HMFWCCCXY:2:2224:29305:73106 1:N:0:GATCAG

@E00489:288:HMFWCCCXY:2:2224:29325:73106 1:N:0:GATCAG

(base) [hwick <at> zappalogin test_2019]$ zcat MH2_R2.fastq.gz | tail -n8 | grep
^@

@E00489:288:HMFWCCCXY:2:2224:29305:73106 2:N:0:GATCAG

@E00489:288:HMFWCCCXY:2:2224:29325:73106 2:N:0:GATCAG




On Fri, Jun 7, 2019 at 9:29 PM Assaf Gordon <assafgordon <at> gmail.com> wrote:

> Hello,
>
> On Fri, Jun 07, 2019 at 02:23:15PM -0400, Heather Wick wrote:
> > I am using split to split up some large, paired fastq files [...]:
> >
> >   zcat MH1_R1.fastq.gz | split - -l 40000000 DHT_R1_
> >   zcat MH1_R2.fastq.gz | split - -l 40000000 DHT_R2_
> >
> > This creates 96 chunks for the R1 and 95 chunks for R2, even though the
> > orignal fastq files have the same number of reads.
> >
> > Do you have any suggestions for how to proceed? Perhaps zcatting and
> piping
> > the files is not the best way to call split?
>
> To help diagnose to issue better, please run the following commands
> and tell us what are the results:
>
> 1. number of lines in each file:
>
>    zcat MH1_R1.fastq.gz | wc -l
>    zcat MH1_R2.fastq.gz | wc -l
>
> 2. The first two sequence IDs:
>
>    zcat MH1_R1.fastq.gz | head -n8 | grep ^@
>    zcat MH1_R2.fastq.gz | head -n8 | grep ^@
>
> 3. Last two sequence IDs:
>
>    zcat MH1_R1.fastq.gz | tail -n8 | grep ^@
>    zcat MH1_R2.fastq.gz | tail -n8 | grep ^@
>
> These will just verify the FASTQ files are indeed paired with no
> surprises. The files should have the same number of lines,
> and matching sequence IDs in the first and last lines.
>
> regards,
>  - assaf
>
>

-- 
Heather Wick
PhD Candidate, Human Genetics
Labs of Sarah Wheelan and Vasan Yegnasubramanian
Institute of Genetic Medicine
Johns Hopkins University School of Medicine
hwick1 <at> jhmi.edu
[Message part 2 (text/html, inline)]

This bug report was last modified 5 years and 332 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.