Hi,
Yes, sorry, I should have specified that I already checked that the original fastq files are indeed paired and sorted with the same number of lines and same starting/ending IDs, narrowing down the issue to a problem with split.
~ Heather


(base) [hwick@zappalogin ~]$ zcat  MH2_R2.fastq.gz | wc -l

3778103832

(base) [hwick@zappalogin ~]$ zcat  MH2_R1.fastq.gz | wc -l

3778103832


(base) [hwick@zappalogin test_2019]$ zcat MH2_R1.fastq.gz | head -n8 | grep ^@

@A00197:48:HF2GWDMXX:1:1101:1741:1000 1:N:0:GATCAG+TCTTTCCC

@A00197:48:HF2GWDMXX:1:1101:2754:1000 1:N:0:GATCAG+TCTTTCCC

(base) [hwick@zappalogin test_2019]$ zcat MH2_R2.fastq.gz | head -n8 | grep ^@

@A00197:48:HF2GWDMXX:1:1101:1741:1000 2:N:0:GATCAG+TCTTTCCC

@A00197:48:HF2GWDMXX:1:1101:2754:1000 2:N:0:GATCAG+TCTTTCCC


(base) [hwick@zappalogin test_2019]$ zcat MH2_R1.fastq.gz | tail -n8 | grep ^@

@E00489:288:HMFWCCCXY:2:2224:29305:73106 1:N:0:GATCAG

@E00489:288:HMFWCCCXY:2:2224:29325:73106 1:N:0:GATCAG

(base) [hwick@zappalogin test_2019]$ zcat MH2_R2.fastq.gz | tail -n8 | grep ^@

@E00489:288:HMFWCCCXY:2:2224:29305:73106 2:N:0:GATCAG

@E00489:288:HMFWCCCXY:2:2224:29325:73106 2:N:0:GATCAG




On Fri, Jun 7, 2019 at 9:29 PM Assaf Gordon <assafgordon@gmail.com> wrote:
Hello,

On Fri, Jun 07, 2019 at 02:23:15PM -0400, Heather Wick wrote:
> I am using split to split up some large, paired fastq files [...]:
>
>   zcat MH1_R1.fastq.gz | split - -l 40000000 DHT_R1_
>   zcat MH1_R2.fastq.gz | split - -l 40000000 DHT_R2_
>
> This creates 96 chunks for the R1 and 95 chunks for R2, even though the
> orignal fastq files have the same number of reads.
>
> Do you have any suggestions for how to proceed? Perhaps zcatting and piping
> the files is not the best way to call split?

To help diagnose to issue better, please run the following commands
and tell us what are the results:

1. number of lines in each file:

   zcat MH1_R1.fastq.gz | wc -l
   zcat MH1_R2.fastq.gz | wc -l

2. The first two sequence IDs:

   zcat MH1_R1.fastq.gz | head -n8 | grep ^@
   zcat MH1_R2.fastq.gz | head -n8 | grep ^@

3. Last two sequence IDs:

   zcat MH1_R1.fastq.gz | tail -n8 | grep ^@
   zcat MH1_R2.fastq.gz | tail -n8 | grep ^@

These will just verify the FASTQ files are indeed paired with no
surprises. The files should have the same number of lines,
and matching sequence IDs in the first and last lines.

regards,
 - assaf



--
Heather Wick
PhD Candidate, Human Genetics
Labs of Sarah Wheelan and Vasan Yegnasubramanian
Institute of Genetic Medicine
Johns Hopkins University School of Medicine