I’m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
I tried to answser this question using the following Erlang code:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
Compilation/Execution:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
this code seems to work fine for a small fasta file but it takes hours to parse a larger one (>100Mo). Why ? I’m an Erlang newbie, can you please improve this code ?
If you need really fast IO then you have to do little bit more trickery than usual.
It is fastest IO as I know but note
-noshell -noinput.Compile just like
erlc +native +"{hipe, [o3]}" g.erlbut with-smp disableand run:
With
-smp enablebut native it takes:Byte code but with
-smp disable(almost in par with native because most of work is done in port!):Just for completeness byte code with smp:
For comparison sarnold version gives me wrong answer and takes more on same HW:
EDIT: I have looked at characteristics of
uniprot_sprot.fastaand I’m little bit surprised. It is 3824397 rows and 232MB. It means that-smp disabledversion can handle 1.18 million text lines per second (71MB/s in line oriented IO).