Hello I am trying to make a little spider.
While I was building it I came across a problem where I need to check if a link is a root domain link or a subdomain link.
For example:
http://www.domain.com or
http://domain.com
http://domain.com/index.php
http://domain.com/default.php
http://domain.com/index.html
http://domain.com/default.html
.
.
etc
are all the same.
So I need a function actually that takes the string url as an input and checks if it’s the root or homepage whatever you like to call it of a site.
As noted in comments, this is really a basic aspect of coding the spider. If you intend to code a general purpose spider, you’ll need to add means to resolve URLs and detect if they point to the same content and in what way (through a redirect or simply through duplicate content), as well as what kind of content they point to.
You need at least to handle:
These are just some of the aspects but it all comes down to the point that the kind of detection your after have to be a fundamental part of the spider if you intend to use it in any kind of generic manner.