I’m writing a Python web crawler and I want to make it multi-threaded. Now

Question

0

Asked: June 17, 20262026-06-17T13:47:07+00:00 2026-06-17T13:47:07+00:00

I’m writing a Python web crawler and I want to make it multi-threaded. Now

0

I’m writing a Python web crawler and I want to make it multi-threaded. Now I have finished the basic part, below is what it does:

a thread gets a url from the queue;
the thread extracts the links from the page, checks if the links exist in a pool (a set), and puts the new links to the queue and the pool;
the thread writes the url and the http response to a csv file.

But when I run the crawler, it always gets stuck eventually, not exiting properly. I have gone through the official document of Python but still have no clue.

Below is the code:

#!/usr/bin/env python
#!coding=utf-8

import requests, re, urlparse
import threading
from Queue import Queue
from bs4 import BeautifulSoup

#custom modules and files
from setting import config


class Page:

    def __init__(self, url):

        self.url = url
        self.status = ""
        self.rawdata = ""
        self.error = False

        r = ""

        try:
            r = requests.get(self.url, headers={'User-Agent': 'random spider'})
        except requests.exceptions.RequestException as e:
            self.status = e
            self.error = True
        else:
            if not r.history:
                self.status = r.status_code
            else:
                self.status = r.history[0]

        self.rawdata = r

    def outlinks(self):

        self.outlinks = []

        #links, contains URL, anchor text, nofollow
        raw = self.rawdata.text.lower()
        soup = BeautifulSoup(raw)
        outlinks = soup.find_all('a', href=True)

        for link in outlinks:
            d = {"follow":"yes"}
            d['url'] = urlparse.urljoin(self.url, link.get('href'))
            d['anchortext'] = link.text
            if link.get('rel'):
                if "nofollow" in link.get('rel'):
                    d["follow"] = "no"
            if d not in self.outlinks:
                self.outlinks.append(d)


pool = Queue()
exist = set()
thread_num = 10
lock = threading.Lock()
output = open("final.csv", "a")

#the domain is the start point
domain = config["domain"]
pool.put(domain)
exist.add(domain)


def crawl():

    while True:

        p = Page(pool.get())

        #write data to output file
        lock.acquire()
        output.write(p.url+" "+str(p.status)+"\n")
        print "%s crawls %s" % (threading.currentThread().getName(), p.url)
        lock.release()

        if not p.error:
            p.outlinks()
            outlinks = p.outlinks
            if urlparse.urlparse(p.url)[1] == urlparse.urlparse(domain)[1] :
                for link in outlinks:
                    if link['url'] not in exist:
                        lock.acquire()
                        pool.put(link['url'])
                        exist.add(link['url'])
                        lock.release()
        pool.task_done()            


for i in range(thread_num):
    t = threading.Thread(target = crawl)
    t.start()

pool.join()
output.close()

Any help would be appreciated!

Thanks

Marcus

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T13:47:08+00:00

Your crawl function has an infinite while loop with no possible exit path.
The condition True always evaluates to True and the loop continues, as you say,

not exiting properly

Modify the crawl function’s while loop to include a condition. For instance, when the number of links saved to the csv file exceeds a certain minimum number, then exit the while loop.

i.e.,

def crawl():
    while len(exist) <= min_links:
        ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a Python web crawler and I want to make it multi-threaded. Now

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply