I’m getting my BeautifulSoup and python bearings by walking through the process of scraping

Question

0

Asked: June 13, 20262026-06-13T14:33:11+00:00 2026-06-13T14:33:11+00:00

I’m getting my BeautifulSoup and python bearings by walking through the process of scraping

0

I’m getting my BeautifulSoup and python bearings by walking through the process of scraping a friend’s (structured, if clunky) website, with the long term goal of migrating the whole thing into a content management system.

If I pull out exactly one cell (soup = BeautifulSoup(urllib2.urlopen("http://www.bicyclepaintings.com/archive/index.html")) in the console with:

cell = soup.find_all('td',{'valign':'bottom'})[3]

I can play around with pulling out substrings. These all work fine: cell.br.next_sibling, cell.find('b').text. But when I try to loop through all the cells with a for loop:

def parse_archive(url):
    soup = get_soup(url)
    paintings = []
    for cell in soup.find_all('td',{'valign':'bottom'}):
        painting_title = cell.find('b').text
        painting_media = cell.br.next_sibling 
        record = painting_title, painting_media
        paintings.append(record)
    return paintings

I get an attribute error (AttributeError: 'NoneType' object has no attribute 'text'). I can get some of the same information by looping back through:

    for item in cell.find_all('b'):
        painting_title = item.text

But I don’t see a way to get at the sibling to <br/> and (more to the point) I don’t understand why it works if I pull one item out but not if I try to access them through a for loop. What am I missing here?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T14:33:12+00:00

Your issue is that the site you are trying to scrape has a bunch of <td> tags at the end that do not contain a <b> tag:

<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>

You just need to modify your code to ignore these tags:

for cell in soup.find_all('td',{'valign':'bottom'}):
    title = cell.find('b')
    if title is None:
        continue
    painting_title = title.text
    painting_media = cell.br.next_sibling 
    record = painting_title, painting_media
    paintings.append(record)

As far as matching the painting_media you can just use:

painting_media = list(cell.br.children)
painting_media = painting_media[0].strip() if painting_media else ''

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m getting my BeautifulSoup and python bearings by walking through the process of scraping

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply