I try to write a script to crawl my site. But I stuck to

Question

0

Asked: June 4, 20262026-06-04T23:43:39+00:00 2026-06-04T23:43:39+00:00

I try to write a script to crawl my site. But I stuck to

0

I try to write a script to crawl my site.
But I stuck to the line 15 at the “if statement”; It does not make a comparison.
I think it’s an encoding problem, or contain other characters. I guess.
The document encoding is ANSI and the website is ISO-8859-15.

HParser.py:

from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint
import urllib2

url = 'http://DOMAIN.TLD'
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        tag = unicode(tag)
        tag = tag.strip()
        print "'",tag,"'"
        if tag == 'a':
            for attr in attrs:
                if 'src' == attr[0]:
                    print 'Link: ', attr[1]

    def handle_endtag(self, tag):
        pass

    def handle_data(self, data):
        pass

    def handle_comment(self, data):
        pass

    def handle_entityref(self, name):
        pass

    def handle_charref(self, name):
        pass

    def handle_decl(self, data):
        pass

parser = MyHTMLParser()
parser.feed(the_page)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T23:43:40+00:00

I tested your code a little using the stackoverflow main page as the url. Here are what I found:

1) tag == 'a' evaluates to True correctly when it is ‘a’.

2) attr prints out tuple as you expected. For example:

('href', 'http://creativecommons.org/licenses/by-sa/3.0/')
('class', 'cc-wiki-link')

So what I think this means is that you just never have any tuple with the first element being ‘src’ . When I parse the main stackoverflow page, I didn’t get any tuple attr with attr[0] being ‘src’ either.

In short, the problem is with the if condition on line 18.

Now, I don’t know html well enough to know if the ‘src’ attribute ever goes with the <a> tag, but I usually see ‘src’ with <img> tag, and ‘href’ with the <a> tag. So you may want to change line 18 to if attr[0] == 'href' instead.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I try to write a script to crawl my site. But I stuck to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply