I’m getting my BeautifulSoup and python bearings by walking through the process of scraping a friend’s (structured, if clunky) website, with the long term goal of migrating the whole thing into a content management system.
If I pull out exactly one cell (soup = BeautifulSoup(urllib2.urlopen("http://www.bicyclepaintings.com/archive/index.html")) in the console with:
cell = soup.find_all('td',{'valign':'bottom'})[3]
I can play around with pulling out substrings. These all work fine: cell.br.next_sibling, cell.find('b').text. But when I try to loop through all the cells with a for loop:
def parse_archive(url):
soup = get_soup(url)
paintings = []
for cell in soup.find_all('td',{'valign':'bottom'}):
painting_title = cell.find('b').text
painting_media = cell.br.next_sibling
record = painting_title, painting_media
paintings.append(record)
return paintings
I get an attribute error (AttributeError: 'NoneType' object has no attribute 'text'). I can get some of the same information by looping back through:
for item in cell.find_all('b'):
painting_title = item.text
But I don’t see a way to get at the sibling to <br/> and (more to the point) I don’t understand why it works if I pull one item out but not if I try to access them through a for loop. What am I missing here?
Your issue is that the site you are trying to scrape has a bunch of
<td>tags at the end that do not contain a<b>tag:You just need to modify your code to ignore these tags:
As far as matching the
painting_mediayou can just use: