I’m struggling with parsing some flaky HTML tables down to lists with Beautiful Soup.

Question

0

Asked: June 9, 20262026-06-09T21:18:33+00:00 2026-06-09T21:18:33+00:00

I’m struggling with parsing some flaky HTML tables down to lists with Beautiful Soup.

0

I’m struggling with parsing some flaky HTML tables down to lists with Beautiful Soup. The tables in question lack a </td> tag.

Using the following code (not the real tables I’m parsing, but functionally similar):

import bs4
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
def walk_table2(text):
    "Take an HTML table and spit out a list of lists (of entries in a row)."
    soup = bs4.BeautifulSoup(text)
    return [[x for x in row.findAll('td')] for row in soup.findAll('tr')]

print walk_table2(test)

Gives me:

[[<td>1<td>2<td>3</td></td></td>, <td>2<td>3</td></td>, <td>3</td>], [<td>4<td>5<td>6</td></td></td>, <td>5<td>6</td></td>, <td>6</td>]]

Rather than the expected:

[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]

It seems that the lxml parser that Beautiful Soup is using decides to add the </td> tag before the next instance of </tr> rather than the next instance of <td>.

At this point, I’m wondering if there a good option to make the parser place the ending td tags in the correct location, or if it would be easier to use a regular expression to place them manually before tossing the string into BeautifulSoup… Any thoughts? Thanks in advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T21:18:34+00:00

You’re seeing decisions made by Python’s built-in HTML parser. If you don’t like the way that parser does things, you can tell Beautiful Soup to use a different parser. The html5lib parser and the lxml parser both give the result you want:

>>> soup = bs4.BeautifulSoup(test, "lxml")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]

>>> soup = bs4.BeautifulSoup(test, "html5lib")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m struggling with parsing some flaky HTML tables down to lists with Beautiful Soup.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply