I am data mining a website using Beautiful Soup . The first page is

Question

0

Asked: May 30, 20262026-05-30T03:45:36+00:00 2026-05-30T03:45:36+00:00

I am data mining a website using Beautiful Soup . The first page is

0

I am data mining a website using Beautiful Soup. The first page is Scoutmob’s map, where I grab each city, open up the page, and grab the URL of each deal in that city.

Currently I’m not using threads and everything is being processed serially. For about all 500 deals (from all cities), my program currently takes about 400 seconds.

For practice, I wanted to modify my code to use threading. I have read up some tutorials and examples on how to create queues in Python, but I don’t want to create 500 threads to process 500 URLs.

Instead I want to create about 20 (worker) threads to process all the URLs. Can someone show me an example how 20 threads can process 500 URL in a queue?

I want each worker to grab an unprocessed URL from the queue, and data mine, then once finished, work on another unprocessed URL. Each worker only exit when there is no more URLs in the queue.

By the way, while each worker is data mining, it also writes the content to a database. So there might be issues with threading in the database, but that is another question for another day :-).

Thanks in advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T03:45:38+00:00

For your example creating worker queues is probably overkill. You might have better luck if you grab the rss feed published for each of the pages rather than trying to parse the HTML which is slower. I slapped together the quick little script below that parses it in a total of ~13 seconds… ~8 seconds to grab the cities and ~5 seconds to parse all the rss feeds.

In today’s run it grabs 310 total deals from 13 cities (there are a total of 20 cities listed, but 7 of them are listed as “coming soon”).

#!/usr/bin/env python

from lxml import etree, html
from urlparse import urljoin
import time

t = time.time()
base = 'http://scoutmob.com/'
main = html.parse(base)
cities = [x.split('?')[0] for x in main.xpath("//a[starts-with(@class, 'cities-')]/@href")]
urls = [urljoin(base, x + '/today') for x in cities]
docs = [html.parse(url) for url in urls]
feeds = [doc.xpath("//link[@rel='alternate']/@href")[0] for doc in docs]
# filter out the "coming soon" feeds
feeds = [x for x in feeds if x != 'http://feeds.feedburner.com/scoutmob']
print time.time() - t
print len(cities), cities
print len(feeds), feeds

t = time.time()
items = [etree.parse(x).xpath("//item") for x in feeds]
print time.time() - t
count = sum(map(len, items))
print count

Yields this output:

7.79690480232
20 ['/atlanta', '/new-york', '/san-francisco', '/washington-dc', '/charlotte', '/miami', '/philadelphia', '/houston', '/minneapolis', '/phoenix', '/san-diego', '/nashville', '/austin', '/boston', '/chicago', '/dallas', '/denver', '/los-angeles', '/seattle', '/portland']
13 ['http://feeds.feedburner.com/scoutmob/atl', 'http://feeds.feedburner.com/scoutmob/nyc', 'http://feeds.feedburner.com/scoutmob/sf', 'http://scoutmob.com/washington-dc.rss', 'http://scoutmob.com/nashville.rss', 'http://scoutmob.com/austin.rss', 'http://scoutmob.com/boston.rss', 'http://scoutmob.com/chicago.rss', 'http://scoutmob.com/dallas.rss', 'http://scoutmob.com/denver.rss', 'http://scoutmob.com/los-angeles.rss', 'http://scoutmob.com/seattle.rss', 'http://scoutmob.com/portland.rss']
4.76977992058
310

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am data mining a website using Beautiful Soup . The first page is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply