I am trying to create a crawler that crawl first 100 pages on a website:
My code is something like this:
def extractproducts(pagenumber):
contenturl = "http://websiteurl/page/" + str(pagenumber)
content = BeautifulSoup(urllib2.urlopen(contenturl).read())
print pagehtml
pagenumberlist = range(1, 101)
for pagenumber in pagenumberlist:
extractproducts(pagenumber)
How do i go about using threading module in this situation so that urllib will crawl X number of URLs at a time using mutli threads?
/newb out
Most likely, you want to use multiprocessing. There’s a
Poolyou can use to execute multiple things in parallel:If your function returns anything,
Pool.mapwill return a list of return values: