I’ve got an HTML table that I’m trying to parse the information from. However,

Question

0

Asked: May 16, 20262026-05-16T20:43:37+00:00 2026-05-16T20:43:37+00:00

I’ve got an HTML table that I’m trying to parse the information from. However,

0

I’ve got an HTML table that I’m trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I’m thinking of just using a list of lists so I would turn something like

<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>

into

[['1,1', '1,2'],
 ['2,1', '2,2']]

Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there’s a lot of completely unnecessary information:

    <td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&amp;style=L&amp;positioning=A&amp;adddirect=yes&amp;accessid=CreateNewEdit&amp;filterblock=N&amp;popeditform=yes&amp;returncalendar=student_center/sc_all_rooms')"
     class="listdefaultmonthbg" 
     style="cursor:crosshair;" 
     width="5%" 
     nowrap="1" 
     rowspan="1">
       <a class="listdatelink" 
          href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&amp;display=W&amp;positioning=A&amp;filterblock=N&amp;adddirect=yes&amp;accessid=CreateNewEdit">Sep 5</a>
    </td>

And what the code really looks like is even worse. All I really need out of there is:

<td rowspan="1">Sep 5</td>

Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:

<tr>
  <td rowspan="2">Sep 5</td>
  <td>Some event</td>
</tr>
<tr>
  <td>Some other event</td>
</tr>

would end out like this:

[["Sep 5", "Some event"],
 [None, "Some other event"]]

There are multiple tables on the page, and I can find the one I want already, I’m just not sure how to parse out the information I need. I know I can use BeautfulSoup to “RenderContents”, but in some cases there are link tags that I need to get rid of (while keeping the text).

I was thinking of a process something like this:

Find table
Count rows in tables (len(table.findAll('tr'))?)
Create list
Parse table into list (BeautifulSoup syntax???)
???
Profit! (Well, it’s a purely internal program, so not really… )

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T20:43:37+00:00

Editorial Team

2026-05-16T20:43:37+00:00Added an answer on May 16, 2026 at 8:43 pm

There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.

http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve got an HTML table that I’m trying to parse the information from. However,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply