I have some data which look like that:
PMID- 19587274
OWN - NLM
DP - 2009 Jul 8
TI - Domain general mechanisms of perceptual decision making in human cortex.
PG - 8675-87
AB - To successfully interact with objects in the environment, sensory evidence must
be continuously acquired, interpreted, and used to guide appropriate motor
responses. For example, when driving, a red
AD - Perception and Cognition Laboratory, Department of Psychology, University of
California, San Diego, La Jolla, California 92093, USA.
PMID- 19583148
OWN - NLM
DP - 2009 Jun
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
amyloidosis.
PG - 482-6
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
extracellular accumulation of pathologic fibrillar proteins in various tissues
AD - Asklepios Hospital, Department of Medicine, Langen, Germany.
innere2.longen@asklepios.com
I want to write a regex which can match the sentences which follow PMID, TI and AB.
Is it possible to get these in a one shot regex?
I have spent nearly the whole day to try to figure out a regex and the closest I could get is that:
reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()
Which will return me the matches only in the second “set” of data, and not all of them.
Any idea? Thank you!
How about:
Output:
Edit
As a verbose RE to make it more understandable (I think verbose REs should be used for anything but the simplest of expressions, but that’s just my opinion!):
Note that you could replace the
^PGand^ADwith^\Sto make it more general (you want to match everything up until the first non-space at the start of a line).Edit 2
If you want to catch the whole thing in one regexp, get rid of the starting
(?:, the ending)and change the|characters to.*?:This gives: