I have a basic loop to look for links on a page i have retrieved with urllib2.urlopen, however i am trying to only follow internal links on a page..
Any ideas how to make my below loop get only links that are on the same domain?
for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}):
webpage = urllib2.urlopen(tag['href']).read()
print 'Deep crawl ----> ' +str(tag['href'])
try:
code-to-look-for-some-data...
except Exception, e:
print e
if the host is the same or the host is empty(which is for relative paths) the url belongs to the same host.
because you do this
no relative paths will be used.
the are like
maybe do not use re at all?