Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7526147
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T03:45:36+00:00 2026-05-30T03:45:36+00:00

I am data mining a website using Beautiful Soup . The first page is

  • 0

I am data mining a website using Beautiful Soup. The first page is Scoutmob’s map, where I grab each city, open up the page, and grab the URL of each deal in that city.

Currently I’m not using threads and everything is being processed serially. For about all 500 deals (from all cities), my program currently takes about 400 seconds.

For practice, I wanted to modify my code to use threading. I have read up some tutorials and examples on how to create queues in Python, but I don’t want to create 500 threads to process 500 URLs.

Instead I want to create about 20 (worker) threads to process all the URLs. Can someone show me an example how 20 threads can process 500 URL in a queue?

I want each worker to grab an unprocessed URL from the queue, and data mine, then once finished, work on another unprocessed URL. Each worker only exit when there is no more URLs in the queue.

By the way, while each worker is data mining, it also writes the content to a database. So there might be issues with threading in the database, but that is another question for another day :-).

Thanks in advance!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T03:45:38+00:00Added an answer on May 30, 2026 at 3:45 am

    For your example creating worker queues is probably overkill. You might have better luck if you grab the rss feed published for each of the pages rather than trying to parse the HTML which is slower. I slapped together the quick little script below that parses it in a total of ~13 seconds… ~8 seconds to grab the cities and ~5 seconds to parse all the rss feeds.

    In today’s run it grabs 310 total deals from 13 cities (there are a total of 20 cities listed, but 7 of them are listed as “coming soon”).

    #!/usr/bin/env python
    
    from lxml import etree, html
    from urlparse import urljoin
    import time
    
    t = time.time()
    base = 'http://scoutmob.com/'
    main = html.parse(base)
    cities = [x.split('?')[0] for x in main.xpath("//a[starts-with(@class, 'cities-')]/@href")]
    urls = [urljoin(base, x + '/today') for x in cities]
    docs = [html.parse(url) for url in urls]
    feeds = [doc.xpath("//link[@rel='alternate']/@href")[0] for doc in docs]
    # filter out the "coming soon" feeds
    feeds = [x for x in feeds if x != 'http://feeds.feedburner.com/scoutmob']
    print time.time() - t
    print len(cities), cities
    print len(feeds), feeds
    
    t = time.time()
    items = [etree.parse(x).xpath("//item") for x in feeds]
    print time.time() - t
    count = sum(map(len, items))
    print count
    

    Yields this output:

    7.79690480232
    20 ['/atlanta', '/new-york', '/san-francisco', '/washington-dc', '/charlotte', '/miami', '/philadelphia', '/houston', '/minneapolis', '/phoenix', '/san-diego', '/nashville', '/austin', '/boston', '/chicago', '/dallas', '/denver', '/los-angeles', '/seattle', '/portland']
    13 ['http://feeds.feedburner.com/scoutmob/atl', 'http://feeds.feedburner.com/scoutmob/nyc', 'http://feeds.feedburner.com/scoutmob/sf', 'http://scoutmob.com/washington-dc.rss', 'http://scoutmob.com/nashville.rss', 'http://scoutmob.com/austin.rss', 'http://scoutmob.com/boston.rss', 'http://scoutmob.com/chicago.rss', 'http://scoutmob.com/dallas.rss', 'http://scoutmob.com/denver.rss', 'http://scoutmob.com/los-angeles.rss', 'http://scoutmob.com/seattle.rss', 'http://scoutmob.com/portland.rss']
    4.76977992058
    310
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am reading an article in IEEE Computer magazine about using data mining on
I'm using Microsoft Decision Trees in Microsoft Analysis Services Data Mining, and need to
I'm using Celery to process multiple data-mining tasks. One of these tasks connects to
I'm using Java for a data mining project and am having an odd issue
I am using weka data mining tool. In weka I am trying to use
I'm data-mining information from a website, and one of the things I must do
Which forums you are using for data mining questions? SO is mainly intended for
I have done a course in data warehousing and data mining , and I
I have recently become interested in the field(s) of data mining and machine learning.
I'm trying to get a report built up from data mining our accounting software.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.