@solved C# with the same code is twice as fast
i am parsing a phred33 fastq file in perl and it is taking a considerable amount of time (on the order of 15 minutes). The fastq file is about 3 gigs.
Are there any reasonable ways to make this faster?
$file=shift;
open(FILE,$file);
open(FILEFA,">".$file.".fa");
open(FILEQA,">".$file.".qual");
while($line=<FILE>)
{
chomp($line);
if($line=~m/^@/)
{
$header=$line;
$header =~ s/@/>/g;
$seq=<FILE>;
chomp($seq);
$nothing=<FILE>;
$nothing="";
$fastq=<FILE>;
print FILEFA $header."\n";
print FILEFA $seq."\n";
$seq="";
print FILEQA $header."\n";
@elm=split("",$fastq);
$i=0;
while(defined($elm[$i]))
{
$Q = ord($elm[$i]) - 33;
if($Q!="-23")
{
print FILEQA $Q." ";
}
$i=$i+1;
}
print FILEQA "\n";
}
}
print $file.".fa\n";
print $file.".qual\n";
There’s next to no CPU being used here. it’s IO bound, so it’s mostly the time to read through 3GB. There are micro-optimisations (and other cleanups) that can be done.
First, always use
use strict; use warnings;.The main code is
The purpose of
if($Q!="-23")is to check if the character is a newline, which you wouldn’t have to do if you didchomp($fastq);. (What’s with the quotes around-23?!)Using a
whileloop just complicates things. Use a for loop when you have a known number of iterations.It might help a bit to turn that inside out.
On second thought, not inside out enough 🙂
But what it we precalculated the translations? Then we wouldn’t have to call a sub (the
/ecode) repeatedly.After a bit more cleanup, we get: