I’d like to know how to fix broken html tags before parsing it with Beautiful Soup.
In the following script the td> needs to be replaced with <td.
How can I do the substitution so Beautiful Soup can see it?
from BeautifulSoup import BeautifulSoup
s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""
a = BeautifulSoup(s)
left = []
right = []
for tr in a.findAll('tr'):
l, r = tr.findAll('td')
left.extend(l.findAll(text=True))
right.extend(r.findAll(text=True))
print left + right
Edit (working):
I grabbed a complete (at least it should be complete) list of all html tags from w3 to match against. Try it out:
Produces:
This one should match broken ending tags as well (
</endtag>):