I’m struggling with parsing some flaky HTML tables down to lists with Beautiful Soup. The tables in question lack a </td> tag.
Using the following code (not the real tables I’m parsing, but functionally similar):
import bs4
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
def walk_table2(text):
"Take an HTML table and spit out a list of lists (of entries in a row)."
soup = bs4.BeautifulSoup(text)
return [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
print walk_table2(test)
Gives me:
[[<td>1<td>2<td>3</td></td></td>, <td>2<td>3</td></td>, <td>3</td>], [<td>4<td>5<td>6</td></td></td>, <td>5<td>6</td></td>, <td>6</td>]]
Rather than the expected:
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
It seems that the lxml parser that Beautiful Soup is using decides to add the </td> tag before the next instance of </tr> rather than the next instance of <td>.
At this point, I’m wondering if there a good option to make the parser place the ending td tags in the correct location, or if it would be easier to use a regular expression to place them manually before tossing the string into BeautifulSoup… Any thoughts? Thanks in advance!
You’re seeing decisions made by Python’s built-in HTML parser. If you don’t like the way that parser does things, you can tell Beautiful Soup to use a different parser. The html5lib parser and the lxml parser both give the result you want: