i’m facing regulars expressions for the first time and i need to extract some data from this report (a txt file with formatting info):
\n10: Vikelis M, Rapoport AM. Role of
antiepileptic drugs as preventive
agents for \nmigraine. CNS Drugs. 2010
Jan 1;24(1):21-33.
doi:\n10.2165/11310970-000000000-00000.
Review. PubMed PMID:
20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E,
Johannessen SI. Antiepileptic\ndrugs
in epilepsy and other disorders–a
population-based study of
prescriptions.\nEpilepsy Res. 2009
Nov;87(1):31-9. Epub 2009 Aug 13.
PubMed PMID: 19679449.\n\n\n
As you can see all the txt’s records begins with a number like “xx:” and always ends with “PubMed PMID: dddddddd. but using a RegEx like this:
regex = re.compile(r"^\d+: .+ PMID: \d{8}.$")
regex.findall(inputfile)
Gives me a list with one big string, so i’m misunderstanding something. How can i extract data from these records?
Use
.+?for non-greedy matching instead of.+which gives you greedy matching. You also want are.DOTALLto make sure your.matches the line-end characters it needs to match, andre.MULTILINEto make sure the^and$match starts and ends of line, not just of the whole string. The options in question need to be joined with the “bit-OR”|operator and passed as the second argument to re.compile.