I want to change this string
<p><b> hello world </b></p>. I am playing <b> python </b>
to:
<bold><bold>hello world </bold></bold>, I am playing <bold> python </bold>
I used:
import re
pattern = re.compile(r'\<p>(.*?)\</p>|\<b>(.*?)\</b>')
print re.sub(pattern, r'<bold>\1</bold>', "<p><b>hello world</b></p>. I am playing <b> python</b>")
It does not output what I want, it complains error: unmatched group
It works in this case:
re.sub(pattern, r'<bold>\1</bold>', "<p>hello world</p>. I am playing <p> python</p>")
<bold> hello world </bold>. I am playing <bold> python</bold>
Although I don’t recommend using Regex for parsing HTML (there are libraries for that purpose in almost every language), this should work:
I think the problem you’re having is because of how Python takes Groups.
Test the following and you’ll see what I mean:
You will see the following:
And anyway, take in count that it matched first what is between
<p></p>so it took<b> hello world </b>(something you would like to match too) as the first match. Maybe changin the order of the compiled regex inpatternwould solve this, but could happen the opposite (having<b><p> ... </p></b>)I wish I could provide more info, but I’m not very good in regex using Python. C# takes them differently.
Edit:
I understand you might want to do this using regex for learning/testing purpose, don’t know, but in production code I would go for another alternative (like the one @Senthil gave you) or just use a HTML Parser.