I have a dataset with FASTA formatted sequencing, basically like this: >pc284 ATCGCGACTCGAC >pc293

Question

0

Asked: June 11, 20262026-06-11T20:51:36+00:00 2026-06-11T20:51:36+00:00

I have a dataset with FASTA formatted sequencing, basically like this: >pc284 ATCGCGACTCGAC >pc293

0

I have a dataset with FASTA formatted sequencing, basically like this:

>pc284  
ATCGCGACTCGAC

>pc293  
ACCCGACCTCAGC

I want to take to use each tag as a key in the dictionary, and store the gene as a value.

This is the code I have, but really isn’t doing anything:

import re
fileData = open('d.fasta', 'r')

myDict = dict()

for line in fileData:
  match = re.search('(\>)(\w+)(\r)(\w+)', line)
  if match: 
    gene = match.group(3)
    myDict[gene[0]] = gene[1]

print myDict

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T20:51:36+00:00

\r is not a valid character class, I think you meant to use \s instead. You can reduce the groups if you don’t use them either.

But most of all, you need to extract your groups correctly:

match = re.search(r'>(\w+)\s+(\w+)', line)
if match:
    tag, gene = match.groups()
    myDict[tag] = gene

By creating only two capturing groups, we can more simply extract those two with .groups() and directly assign them to two variables, tag and gene.

However, reading up on the FASTA format seems to indicate this is a multi-line format with the tag on one line, the gene data on multiple lines after that. In that case your \r was meant to match the newline. This won’t work as you read the file one line at a time.

It would be much simpler to read that format without regular expressions like so:

myDict = {}

with open('d.fasta', 'rU') as fileData:
    tag = None
    for line in fileData:
        line = line.strip()
        if not line:
            continue
        if line[0] == '>':
            tag = line[1:]
            myDict[tag] = ''
        else:
            assert tag is not None, 'Invalid format, found gene without tag'
            myDict[tag] += line

print myDict

This reads the file line by line, detecting tags based on the starting > character, then reads multiple lines of gene information collecting it into your dictionary under the most-recently read tag.

Note the rU mode; we open the file using python’s universal newlines mode, to handle whatever newline convention was used to create the file.

Last but not least; take a look at the BioPy project; their Bio.SeqIO module handles FASTA plus many other formats perfectly.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a dataset with FASTA formatted sequencing, basically like this: >pc284 ATCGCGACTCGAC >pc293

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply