Monday, July 18, 2016

fastest way to count reads in fastq file - use parallel

Unzipped fastq files with 16 cores, took very minute time to count number of reads.


$ time parallel "echo {} && cat {} | wc -l | awk '{d=\$1; print d/4;}'" ::: 500bp_insert*.fastq

500bp_insert_lane1_2_read2PE_single_appended.Genome.completely.cleaned.fastq_ORPHAN.fastq
408600
500bp_insert_lane1_2_read1PE_single_appended.Genome.completely.cleaned.fastq_ORPHAN.fastq
2668085
500bp_insert_lane1_2_read2PE_single_appended.Genome.completely.cleaned.fastq_matched.fastq
148713855
500bp_insert_lane1_2_read1PE_single_appended.Genome.completely.cleaned.fastq_matched.fastq
148713855


real    0m19.850s
user    0m13.144s
sys     0m56.604s

But the zipped files of size around ~ 23 Gb took a bit of around ~5 min.


$time parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: 500bp_insert_lane*_read1.fastq.gz


500bp_insert_lane1_read1.fastq.gz
75640549

500bp_insert_lane2_read1.fastq.gz
76849862


real    2m40.358s
user    5m6.051s
sys     0m24.710s


$ time parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: 500bp_insert_lane*_read2.fastq.gz

500bp_insert_lane2_read2.fastq.gz
76849862
500bp_insert_lane1_read2.fastq.gz
75640549


real    2m49.668s
user    5m17.894s
sys     0m28.807s

No comments:

Post a Comment