I’m currently building a new online Feed Reader in PHP. One of the features I’m working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link> tag.
The problem is, the way I’m currently detecting if the URL is a feed or a website only works part of the time, and I know it can’t be the best solution. Right now I’m taking the CURL response and running it through simplexml_load_string, if it can’t parse it I treat it as a website. Here is the code.
$xml = @simplexml_load_string( $site_found['content'] );
if( !$xml ) // this is a website, not a feed
{
// handle website
}
else
{
// parse feed
}
Obviously, this isn’t ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.
Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?
I would sniff for the various unique identifiers those formats have:
Atom: Source
RSS 0.90: Source
Netscape RSS 0.91
etc. etc. (See the 2nd source link for a full overview).
As far as I can see, separating Atom and RSS should be pretty easy by looking for
<feed>and<rss>tags, respectively. Plus you won’t find those in a valid HTML document.You could make an initial check to tell HTML and feeds apart by looking for
<html>and<body>elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once 🙂If it doesn’t match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.
what that looks like in the wild – whether feed providers always conform to those rules – is a different question, but you should already be able to recognize a lot this way.