I am new to using regular expression in python. I am having trouble figuring

Question

0

Editorial Team

Asked: June 14, 20262026-06-14T15:28:39+00:00 2026-06-14T15:28:39+00:00

I am new to using regular expression in python. I am having trouble figuring

0

I am new to using regular expression in python. I am having trouble figuring out how to do the following:

I have a bunch of text description as strings that looks like this:

FX0XST001ALF89  OLIGO: Bacillus_cand1=ATGCGGTTCAAAATGTTATC      
FILE:/home/AAFC-AAC/fungs/biodiversity/pipelines/454PipelineOutput/v7_newest_testrun_full/rs75/plate1/FX0XST001.MID13/FX0XST001.MID13.sff.trim.fasta    
Project: SAGES  SFF: FX0XST001  SFF.MID: FX0XST001.MID13    
Plate: 1.1     MID_all: MID13   MID: 13 Sample: BK104   
Collector: BK   Year: 2008  Week:   Year_Week:  
Location: Ottawa_ON     City: Ottawa    Province: ON    Crop:   
Treatment:    Substrate_all: Air    Substrate: Air  Target: Bacteria    
Forward Primer: Bac16S27F   Reverse Primer: Bac16S690R  Taq: T

I want to be able extract the categories inside this large string and store them into a database or something, for example:

Year: 2008
Sample: BK104
Collector: BK

etc...

How can I use regular expression in python to achieve this?

I am thinking of using search:

match = re.search(r'Sample:\w\w\w\w\w', theTextDescription)

The problem is the length of the text in each ‘field’ is different. I don’t really know how to take that into consideration

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T15:28:40+00:00

something like this, you can use \w+ to match characters to any number of length:

In [37]: strs
Out[37]: 'FX0XST001ALF89  OLIGO: Bacillus_cand1=ATGCGGTTCAAAATGTTATC      \nFILE:/home/AAFC-AAC/fungs/biodiversity/pipelines/454PipelineOutput/v7_newest_testrun_full/rs75/plate1/FX0XST001.MID13/FX0XST001.MID13.sff.trim.fasta    \nProject: SAGES  SFF: FX0XST001  SFF.MID: FX0XST001.MID13    \nPlate: 1.1     MID_all: MID13   MID: 13 Sample: BK104   \nCollector: BK   Year: 2008  Week:   Year_Week:  \nLocation: Ottawa_ON     City: Ottawa    Province: ON    Crop:   \nTreatment:    Substrate_all: Air    Substrate: Air  Target: Bacteria    \nForward Primer: Bac16S27F   Reverse Primer: Bac16S690R  Taq: T'

In [38]: re.findall(r"\w+:\s\w+",strs)
Out[38]: 
['OLIGO: Bacillus_cand1',
 'Project: SAGES',
 'SFF: FX0XST001',
 'MID: FX0XST001',
 'Plate: 1',
 'MID_all: MID13',
 'MID: 13',
 'Sample: BK104',
 'Collector: BK',
 'Year: 2008',
 'Location: Ottawa_ON',
 'City: Ottawa',
 'Province: ON',
 'Substrate_all: Air',
 'Substrate: Air',
 'Target: Bacteria',
 'Primer: Bac16S27F',
 'Primer: Bac16S690R',
 'Taq: T']

or may be store it in a dictionary:

In [39]: dict(x.split(":") for x in  re.findall(r"\w+:\s\w+",strs))
Out[39]: 
{'City': ' Ottawa',
 'Collector': ' BK',
 'Location': ' Ottawa_ON',
 'MID': ' 13',
 'MID_all': ' MID13',
 'OLIGO': ' Bacillus_cand1',
 'Plate': ' 1',
 'Primer': ' Bac16S690R',
 'Project': ' SAGES',
 'Province': ' ON',
 'SFF': ' FX0XST001',
 'Sample': ' BK104',
 'Substrate': ' Air',
 'Substrate_all': ' Air',
 'Taq': ' T',
 'Target': ' Bacteria',
 'Year': ' 2008'}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am new to using regular expression in python. I am having trouble figuring

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply