I’ve got a list of websites for each US Congress member that I’m programmatically crawling to scrape addresses. Many of the sites vary in their underlying markup, but this wasn’t initially a problem until I started seeing that hundreds of sites were not giving the expected results for the script I had written.
After taking some more time to evaluate potential causes, I found that calling strip_tags() on the results of file_get_contents() was erasing most of the source of the page many times! This was not only removing the HTML, it was removing the non-HTML that I wanted to scrape!
So I removed the call to strip_tags(), substituted a call to remove all non-alphanumeric characters and gave the process another run. It turned up other results, but still lacked many. This time it was because my regular expressions weren’t matching the desired patterns. After looking at the returned code, I realized that I had the remnants of HTML attributes interspersed throughout the text, breaking my patterns.
Is there a way around this? Is it the result of malformed HTML? Can I do anything about it?
There’s a warning in the PHP manual that reads:
Since you are scraping many different sites, and you can’t account for the validity of their HTML, this will always be a problem. Unfortunately, regexps aren’t going to do it for you either, as regexps simply aren’t cut out to be document parsers.
I would use something like PHP Simple HTML DOM Parser, or even the built-in DOMDocument->loadHTML() method.
You could keep a small database that recorded each page you wanted to scrape, and where the information was found in the structure of that page. Each time you scraped it, you could do a quick check to see if the structure had changed, in which case you could update your database with the new path location for your DOM parser, and get it on the next scrape.