I am trying to parse a simple html table using beautifulsoup but I have some problems
Here is my Input
<table id="people" class="tt" width="99%" border="0" cellpadding="0" cellspacing="1">
<tr>
<td colspan="3" bgcolor="#d3d3d3">
<p align="center" style="border: 1px solid #c0c0c0; padding: 0.02in">
<a name="faculty">
</a>
<b>
Faculty
</b>
</p>
</td>
</tr>
<tr>
<td>
<p align="center">
<font color="#000080">
<a href="http://www.website.com/%7Empop">
<font color="#000080">
<img src="images/mpop.jpg" name="graphics1" align="bottom" width="70" height="85" border="1" />
</font>
</a>
</font>
</p>
</td>
<td>
<p>
<b>
John Doe, Ph.D.
</b>
<br />
Associate Professor, Computer
Science
<br />
</p>
</td>
<td>
<p>
Office: Sciences Bldg.
<br />
Phone:
xxx-xxx-xxxx
<br />
jd [at] website.com
<br />
</p>
</td>
</tr>
<tr>
<td>
<p align="center">
<font color="#000080">
<a href="http://www.website.com/%7Ercolwell">
<font color="#000080">
<img src="images/rcolwell.jpg" name="graphics2" align="bottom" width="70" height="97" border="1" />
</font>
</a>
</font>
</p>
</td>
<td>
<p>
<b>
Jane Doe, Ph.D.
</b>
<br />
Professor
<br />
School of Public Health
<br />
</p>
</td>
<td>
<p>
Sciences Bldg
<br />
jd [at]
website.com
<br />
</a>
</p>
</td>
</tr>
</table>
Here is my code
t = soup.findAll("table",id="people")
for table in t:
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
print(str(td.find(text=True))) # tried also print(td.find(text=True))
print(",")
print("\n")
This will generate output with only commas without the text actually, but when I put print(td) I do find the information that I need to output but in html format with all the tags, can anyone point me to the right thing to do here ? I want to extract only the cell content.
Cheers
Maybe you are looking for s.t. like this:
Alternatively you can use
u''.join(map(unicode, td.contents))depending on what exactly you want to get printed.