I am thinking about a way to parse a fasta-file in parallel. For those of you not knowing fasta-format an example:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
So lines starting with an ‘>’ are header lines containing an identifier for the sequence following the identifier.
I suppose you load the entire file to memory but after this i am having trouble finding a way to process these data.
The problem is: Threads can not start at an arbitrary position because they could cut sequences this way.
Does someone has any experience in parsing files in parallel when the lines depend on each other? Any idea is appreciated.
Should be easy enough, since the dependence of lines on each other is very simple in this case: just make the threads start in an arbitrary position and then just skip the lines until they get to one that starts with a ‘>’ (i.e. starts a new sequence).
To make sure no sequence gets processed twice, keep a set of all sequence IDs that have been processed (or you could do it by line number if the sequence IDs aren’t unique, but they really should be!).