I’m working on a crawler project. I’m stuck in a situation wherein the href text on a page keeps on repeating on other pages under that domain.
For example if the url is example.com then the href values on these pages are hrefList=[/hello/world,/aboutus,/blog,/contact].
So urls for these page would be
example.com/hello/world
example.com/aboutus
etc
Now on the page example.com/hello/world, the hrefList is again present. Hence I’ll get urls as
example.com/hello/world/hello/world,
example.com/hello/world/aboutus etc
Now out of these pages /hello/world/hello/world is a proper page with http status as 200 and this is going on happening recursively. Rest of the pages would have page not found and hence can be discarded
I’m getting list of new urls which are not correct urls. Is there any way to overcome this?
This is my codebase:
for url in allUrls:
if url not in visitedUrls:
visitedUrls.append(url)
http=httplib2.Http()
response,content=http.request(url,headers={'User-Agent':'Crawler-Project'})
if (response.status/100<4):
soup=BeautifulSoup(content)
links=soup.findAll('a',href=True)
for link in links:
if link.has_key('href'):
if len(link['href']) > 1:
if not any(x in link['href'] for x in ignoreUrls):
if link['href'][0]!="#":
if "http" in link["href"]:
allUrls.append(link["href"])
else:
if url[-1]=="/" and link['href'][0]=="/":
allUrls.append(url+link['href'][1:])
else:
if not (url[-1] =="/" or link['href'][0] =="/"):
allUrls.append(url+"/"+link['href'])
else:
allUrls.append(url+link['href'])
If we assume that the pages you get are the same, a possible workaround would be to create a hash of the page and make sure that you don’t crawl two pages with the same hash.
What you hash will determine how robust and how resource-intensive this heuristic is. You could hash the whole webpage contents or some combination of its contents/headers and the links found by your crawler (or something else that would be unique enough per webpage other than its URL). Evidently, including the page’s URL is not a good idea as your problem right now is that those pages have different URLs but the same content (with invalid links)
While you can, you shouldn’t have to implement a workaround for webpages that are not done correctly. That will be a neverending story.