I am trying to write a small PHP function that would go through a page (provided a url) and would return the number of links and the number of links that link to the same page. For example, if I provide google.com as a URL it should return how many links are there on google.com and how many links link back to google.com (including of course http://www.google.com, google.com, google.com/#, etc)
Is that easy to do, and how would I do it?
(this is NOT a homework question so please provide as much help as possible
If you need more information about what I mean with the question just ask me to do provide more information
I’d suggest SimpleXml or DOM for this task, but they will choke on invalid markup and unfortunately, the majority of the web is still using invalid markup, including Google you mentioned in your question. You could fetch the HTML from these URLs and tidy them, but you can also use SimpleHTML
Please note that I do not have SimpleHTML installed atm, so the above might not work out of the box. It should point you into the right direction though.
EDIT
Oh boy, did I really wrote this? Was I drunk or something? And why did no one complain about it? To correct myself:
DOM handles broken HTML fine if you use the
loadHTML()method. SimpleXml doesnt. The suggested solution with SimpleHtmlDom will probably work, but IMO SimpleHTMLDom sucks. Better third party libraries can be found in Best Methods to parse HTML.