I’m making a Python script that verifies if a Wikipedia link chain is valid.

Question

0

Asked: June 8, 20262026-06-08T09:32:29+00:00 2026-06-08T09:32:29+00:00

I’m making a Python script that verifies if a Wikipedia link chain is valid.

0

I’m making a Python script that verifies if a Wikipedia link chain is valid. For instance, the chain

List of jōyō kanji > Elementary schools in Japan > Education > Knowledge

is a valid one since you can reach each page only by clicking links.

The issue here is that these pages are really long and downloading the entire page, checking if the link is in the page and repeating all the steps will take a long time. And the chains could be longer too.

So what I want to know is if I can use urllib2 (or any other library) to download each page and tell it to stop when needed or if this would just put more load on the CPU and make things worse.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T09:32:30+00:00

I couldn’t find a way of doing this with urllib2, but there’s one obvious solution using raw sockets:

import urlparse

def found(text, data):
     # return True if text was found in data

def get_title(url):
    parsed_url = urlparse(url)
    host = parsed_url.netloc
    path = parsed_url.path
    port = 80

    web = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        web.connect((host, port))
    except Exception:
        return

    web.send('GET %s HTTP/1.1\n' % path)
    web.send('Host: %s\n' % host)
    web.send('\n')

    done = False
    while not done:
        data = web.recv(2048)
        if len(data) > 0 and found("text", data):
            web.shutdown(socket.SHUT_RDWR)
            web.close()
            title = title_match.group(1)
            done = True

    # Do something

This way you stop downloading once you find the relevant data and avoid downloading unnecessary content from large web pages.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m making a Python script that verifies if a Wikipedia link chain is valid.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply