I have a FASTA file containing several protein sequences. The format is like ———————-

Question

0

Asked: May 15, 20262026-05-15T10:29:48+00:00 2026-05-15T10:29:48+00:00

I have a FASTA file containing several protein sequences. The format is like ———————-

0

I have a FASTA file containing several protein sequences. The format is like

----------------------
>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ

>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*

>protein3
MKTLLLLAVIMIFGLLQAHGNLVNFHRMIKLTTGKEAALSYGF
CHCGVGGRGSPKDATDRCCVTHDCCYKRLEKRGCGTKFLSYKF
SNSGSRITCAKQDSCRSQLCECDKAAATCFARNKTTY`

-----------------------------------

Is there a good way to read in this file and store the sequences separately?

Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T10:29:49+00:00

I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.

You could split the file on newline, and look for a > character to determine the name.

From there it is a little less clear because I’m not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:

var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
    var line = reader.ReadLine();
    if(string.IsNullOrEmpty(line))
        break;
    if(line.StartsWith(">"))
        StoreProteinName(line);
    else
        StoreSequence(line);
}

If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a FASTA file containing several protein sequences. The format is like ———————-

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply