I have a basic loop to look for links on a page i have

Question

0

Asked: June 3, 20262026-06-03T09:43:33+00:00 2026-06-03T09:43:33+00:00

I have a basic loop to look for links on a page i have

0

I have a basic loop to look for links on a page i have retrieved with urllib2.urlopen, however i am trying to only follow internal links on a page..

Any ideas how to make my below loop get only links that are on the same domain?

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}): 
                webpage = urllib2.urlopen(tag['href']).read()
                print 'Deep crawl ----> ' +str(tag['href'])
                try:
                    code-to-look-for-some-data...

                except Exception, e:
                    print e

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T09:43:35+00:00

>>> import urllib
>>> print urllib.splithost.__doc__
splithost('//host[:port]/path') --> 'host[:port]', '/path'.

if the host is the same or the host is empty(which is for relative paths) the url belongs to the same host.

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}):

            href = tag['href']
            protocol, url = urllib.splittype(href) # 'http://www.xxx.de/3/4/5' => ('http', '//www.xxx.de/3/4/5')
            host, path =  urllib.splithost(url)    # '//www.xxx.de/3/4/5' => ('www.xxx.de', '/3/4/5')
            if host.lower() != theHostToCrawl and host != '':
                continue

            webpage = urllib2.urlopen(href).read()

            print 'Deep crawl ----> ' +str(tag['href'])
            try:
                code-to-look-for-some-data...

            except:
                import traceback
                traceback.print_exc()

because you do this

'href': re.compile("^http://")

no relative paths will be used.
the are like

<a href="/folder/file.htm"></a>

maybe do not use re at all?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a basic loop to look for links on a page i have

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply