I use XPath to parse a HTML webpage for fetching all internal links. DOMXPath will return all links provided in href. How can I separate internal an external links?
I introduce a series of string checks to remove external links; but the problem is that there are different ways to link internal pages such as
page.html
/page.html
http://domain.com/page.html
http://subdomain.domain.com/page.html
....
What is the safest way to distinguish internal links (any link to the present domain including its subdomains) and external links (to any other domain).
Use substr() to see if the first 4 characters are http.
If so, use the parse_url() function to check whether the host is the same.
If not, it’s internal.