Working on a small web spider in python, using the lxml module I have

Question

0

Asked: May 17, 20262026-05-17T21:06:52+00:00 2026-05-17T21:06:52+00:00

Working on a small web spider in python, using the lxml module I have

0

Working on a small web spider in python, using the lxml module I have a segment of code which does an xpath query of the document and places all the links from ‘a href’ tags into a list. what I’d like to do is check each link as it is being added to the list, and if it is needed, unescape it. I understand using the urllib.unquote() function, but the problem I’m experiencing is that the urllib method throws an exception which I believe is due to not every link that is passed to the method needs unescaping. Can anyone point me in the right direction? Here’s the code I have so far:

import urllib
import urllib2
from lxml.html import parse, tostring

class Crawler():

    def __init__(self, url):
        self.url = url
        self.links = []
    def crawl(self):

        doc = parse("http://" + self.url).getroot()
        doc.make_links_absolute(self.url, resolve_base_href=True)
        for tag in doc.xpath("//a"):
            old = tag.get('href')
            fixed = urllib.unquote(old)
            self.links.append(fixed)
        print(self.links)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T21:06:53+00:00

Editorial Team

2026-05-17T21:06:53+00:00Added an answer on May 17, 2026 at 9:06 pm

unquote doesn’t throw exceptions because of URLs that don’t need escaping. You haven’t shown us the exception, but I’ll guess that the problem is that old isn’t a string, it’s probably None, because you have an <a> tag with no href attribute.

Check the value of old before you try to use it.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Working on a small web spider in python, using the lxml module I have

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply