Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6039069
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T06:18:51+00:00 2026-05-23T06:18:51+00:00

I’m writing a web scraper in python, using httplib2 and lxml (yes – I

  • 0

I’m writing a web scraper in python, using httplib2 and lxml (yes – I know I could be using scrapy. Let’s move past that…) The scraper has about 15000 pages to parse into approximately 400,000 items. I’ve got the code to parse the items to run instantaneously (almost) but the portion that downloads the page from the server is still extremely slow. I’d like to overcome that through concurrency. However, I can’t rely on EVERY page needing to be parsed EVERY time. I’ve tried with a single ThreadPool (like multiprocessing.pool, but done with threads – which should be fine since this is an I/O bound process), but I couldn’t think of a graceful (or working) way of getting ALL of the threads to stop when the date of the last index item was greater than the item we were processing. Right now, I’m working on a method using two instances of ThreadPool – one to download each page, and another to parse the pages. A simplified code example is:

#! /usr/bin/env python2

import httplib2
from Queue import PriorityQueue
from multiprocessing.pool import ThreadPool
from lxml.html import fromstring

pages = [x for x in range(1000)]
page_queue = PriorityQueue(1000)

url = "http://www.google.com"

def get_page(page):
    #Grabs google.com
    h = httplib2.Http(".cache")
    resp, content = h.request(url, "GET")
    tree = fromstring(str(content), base_url=url)
    page_queue.put((page, tree))
    print page_queue.qsize()

def parse_page():
    page_num, page = page_queue.get()
    print "Parsing page #" + str(page_num)
    #do more stuff with the page here
    page_queue.task_done()

if __name__ == "__main__":
    collect_pool = ThreadPool()
    collect_pool.map_async(get_page, pages)
    collect_pool.close()

    parse_pool = ThreadPool()
    parse_pool.apply_async(parse_page)
    parse_pool.close()


     parse_pool.join()
     collect_pool.join()
     page_queue.join()

Running this code however, doesn’t do what I expect – which is to fire off two threadpools: one populating a queue and another pulling from it to parse. It begins the collect pool and runs through it and then begins the parse_pool and runs through it (I assume, I’ve not let the code run long enough to get to the parse_pool – the point is that collect_pool is all that seems to be running). I’m fairly sure I’ve messed something up with the order of the calls to join(), but I can’t for the life of me figure out what order they’re supposed to be in.
My question is essentially this: Am I barking up the right tree here? and if so, what the hell am I doing wrong? If I’m not – what would your suggestions be

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T06:18:52+00:00Added an answer on May 23, 2026 at 6:18 am

    First of all, your design seems to be correct at a high level. The use of a threadpool for collecting the pages is justified by the synchronous nature of the httlib2 module. (With an asynchronous library one thread would be enough; note that even with httplib2 and the pool at most one collector thread is running at any time because of the GIL.) The parsing pool is justified by the lxml module having been written in C/C++ (and assuming that thus the Global Interpreter Lock is released during the parsing of the page – this is to be checked in the lxml docs or code!). If this latter were not true, then there would be no perfomance gain by having a dedicated parsing pool as only one thread would be able to acquire the GIL. In this case it would be better to use a process pool.

    I am not familiar with the ThreadPool implementation, but I assume that it is analogous to the Pool class in the multiprocessing module. On this basis the problem appears to be that you create only a single work item for the parse_pool and after parse_page processes the first page it never tries to dequeue further pages from there. Additional work items are not submitted to this pool either, so the processing stops, and after the parse_pool.close() call the threads of the (empty) pool terminate.

    The solution is to eliminate the page_queue. The get_page() function should put a work item on the parse_pool by calling apply_async() for every page it collects, instead of feeding them into page_queue.

    The main thread should wait till the collect_queue is empty (i.e. the collect_pool.join() call returned), then it should close the parse_pool (as we can be sure that no more work will be submitted for the parser). Then it should wait for the parse_pool to become empty by calling parse_pool.join() and then exit.

    Furtheremore you need to increase the number of threads in the connect_pool in order to process more http requests concurrently. The default number of threads in a pool is the number of CPUs; currently you cannot issue more than that many requests. You may experiment with values up to thousands or tenthousands; observere the CPU consumption of the pool; it should not approach 1 CPU.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

That's pretty much it. I'm using Nokogiri to scrape a web page what has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I am reading a book about Javascript and jQuery and using one of the
Seemingly simple, but I cannot find anything relevant on the web. What is the
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
We're building an app, our first using Rails 3, and we're having to build
We are using XSLT to translate a RIXML file to XML. Our RIXML contains
Does anyone know how can I replace this 2 symbol below from the string

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.