I am data mining a website using Beautiful Soup. The first page is Scoutmob’s map, where I grab each city, open up the page, and grab the URL of each deal in that city.
Currently I’m not using threads and everything is being processed serially. For about all 500 deals (from all cities), my program currently takes about 400 seconds.
For practice, I wanted to modify my code to use threading. I have read up some tutorials and examples on how to create queues in Python, but I don’t want to create 500 threads to process 500 URLs.
Instead I want to create about 20 (worker) threads to process all the URLs. Can someone show me an example how 20 threads can process 500 URL in a queue?
I want each worker to grab an unprocessed URL from the queue, and data mine, then once finished, work on another unprocessed URL. Each worker only exit when there is no more URLs in the queue.
By the way, while each worker is data mining, it also writes the content to a database. So there might be issues with threading in the database, but that is another question for another day :-).
Thanks in advance!
For your example creating worker queues is probably overkill. You might have better luck if you grab the rss feed published for each of the pages rather than trying to parse the HTML which is slower. I slapped together the quick little script below that parses it in a total of ~13 seconds… ~8 seconds to grab the cities and ~5 seconds to parse all the rss feeds.
In today’s run it grabs 310 total deals from 13 cities (there are a total of 20 cities listed, but 7 of them are listed as “coming soon”).
Yields this output: