I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].
yields
The regex
ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"is exactly the sameunicode as
u'[[1P].+?[/P]]+?'except harder to read.The first bracketed group
[[1P]tells re that any of the characters in the list['[', '1', 'P']should match, and similarly with the second bracketed group[/P]].That’s not what you want at all. So,stray
1in front ofP.)[P], escape the brackets with abackslash:
\[P\].around
.+?.