I try to write a script to crawl my site.
But I stuck to the line 15 at the “if statement”; It does not make a comparison.
I think it’s an encoding problem, or contain other characters. I guess.
The document encoding is ANSI and the website is ISO-8859-15.
HParser.py:
from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint
import urllib2
url = 'http://DOMAIN.TLD'
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
tag = unicode(tag)
tag = tag.strip()
print "'",tag,"'"
if tag == 'a':
for attr in attrs:
if 'src' == attr[0]:
print 'Link: ', attr[1]
def handle_endtag(self, tag):
pass
def handle_data(self, data):
pass
def handle_comment(self, data):
pass
def handle_entityref(self, name):
pass
def handle_charref(self, name):
pass
def handle_decl(self, data):
pass
parser = MyHTMLParser()
parser.feed(the_page)
I tested your code a little using the stackoverflow main page as the url. Here are what I found:
1)
tag == 'a'evaluates to True correctly when it is ‘a’.2) attr prints out tuple as you expected. For example:
So what I think this means is that you just never have any tuple with the first element being ‘src’ . When I parse the main stackoverflow page, I didn’t get any tuple attr with attr[0] being ‘src’ either.
In short, the problem is with the if condition on line 18.
Now, I don’t know html well enough to know if the ‘src’ attribute ever goes with the
<a>tag, but I usually see ‘src’ with<img>tag, and ‘href’ with the<a>tag. So you may want to change line 18 to ifattr[0] == 'href'instead.