I have following python code:
def scrapeSite(urlToCheck):
html = urllib2.urlopen(urlToCheck).read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
tdtags = soup.findAll('td', { "class" : "c" })
for t in tdtags:
print t.encode('latin1')
This will return me following html code:
<td class="c">
<a href="more.asp">FOO</a>
</td>
<td class="c">
<a href="alotmore.asp">BAR</a>
</td>
I’d like to get the text between the a-Node (e.g. FOO or BAR), which would be t.contents.contents. Unfortunately it doesn’t work that easy 🙂
Does anyone have an idea how to solve that?
Thanks a lot, any help is appreciated!
Cheers,
Joseph
In this case, you can use
t.contents[1].contents[0]to get FOO and BAR.The thing is that contents returns a list with all elements (Tags and NavigableStrings), if you print contents, you can see it’s something like
[u'\n', <a href="more.asp">FOO</a>, u'\n']So, to get to the actual tag you need to access
contents[1](if you have the exact same contents, this can vary depending on the source HTML), after you’ve find the proper index you can usecontents[0]afterwards to get the string inside the a tag.Now, as this depends on the exact contents of the HTML source, it’s very fragile. A more generic and robust solution would be to use
find()again to find the ‘a’ tag, viat.find('a')and then use the contents list to get the values in itt.find('a').contents[0]or justt.find('a').contentsto get the whole list.