Sample code:
from BeautifulSoup import BeautifulSoup, SoupStrainer
html='''<tr>
<td align="left">Foo<br />
Bar<br /></td>
</tr>'''
soup=BeautifulSoup(html)
rows=soup.findAll('tr')
print rows
print rows[0].text.encode("utf8")
I would like the output to be something like “Foo Bar” or even if there was an actual newline between the two lines that would be fine, but the output I get just has “FooBar”, note that there is no whitespace between the two lines.
Very new to python and beautifulsoup, can someone give a hand?
You can go one level further using
cell = rows[0].find('td'), then see its contents usingcell.contents, then filter the elements you need, thenjointhem by spaces.Another option: you can use a regular expression for replacing the
<br />by a space. for that you can write:Then you can replace multiple consecutive whitespaces by
Then the string should look like this:
Then you can easily extract the part you need.