I’m new to the whole concept of screen scraping in Python, although I’ve done

Question

0

Asked: May 23, 20262026-05-23T13:16:15+00:00 2026-05-23T13:16:15+00:00

I’m new to the whole concept of screen scraping in Python, although I’ve done

0

I’m new to the whole concept of screen scraping in Python, although I’ve done a bit of screen scraping in R. I’m trying to scrape the Yelp website. I’m trying to scrape the names of each insurance agency which the yelp search returns. With most scraping tasks, I’m able to perform the following task, but always have a hard time going forward with parsing the xml.

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.yelp.com/search?find_desc=insurance+agency&ns=1&find_loc=Austin').read())

print soup

So when scraping a site, what are the steps that one should follow? Is there a set of necessary actions that one needs to take each time they attempt to scrape a site?

I’m running Python 2.6 on Ubuntu 10.10

I realize that this may be a poor SO question as outlined in the faq, but I’m hoping someone can provide some general guidelines and things to consider when scraping a site.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T13:16:16+00:00

I’d recommend read up on xpath & try this scrapy tutorial. http://doc.scrapy.org/intro/tutorial.html . It is fairly easy to write a spider like this

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):
    name = "dmoz.org"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    for site in sites:
        title = site.select('a/text()').extract()
        link = site.select('a/@href').extract()
        desc = site.select('text()').extract()
        print title, link, desc

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m new to the whole concept of screen scraping in Python, although I’ve done

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply