This is driving me nuts! A little piece of code that I can’t seem to debug 🙁 Basically I have an HTML file in a string and I want to find X inside until another X (same value) IF there is another one, if there isn’t, then grab X until end of file.
The code that doesn’t work:
$contents = "< div id="main" class="clearfix"> < div id="col-1">< div id="content">< div id="p19601634">< h1>< span id="ppt19601634">";
$regex = "!<div id="content">(.*?)(?:<div id="content">)!s";>
preg_match_all($regex, $contents, $matches);
Please notice that I added spaces before the DIV for display purpose and that I want to check with NEW LINES and TABS inside the HTML also (basically, there is a line return after the first DIV).
Right now, my code works if it finds many occurences of my search and it will return the searches. But if there is only one item found, it doesnt work.
Does someone knows this?
Thanks a bunch
Regular expressions are not and never will be the right tool for this job. “I have to use regular expressions” is not true. There is computer science theory to explain this: regular expressions are only capable of matching regular languages, but HTML (or XML) is a more sophisticated language than that.
Another solution for you besides DOM mentioned in @meder’s answer is XSLTProcessor. XSLT is a declarative pattern-matching language like regular expressions. But XSLT is capable of matching the hierarchical structure of XHTML or XML.
See the answers in Simple XML parsing on PHP for more solutions, including an example of XSLTProcessor in my answer.
If you want to learn all about HTML scraping techniques in PHP, there’s a book on the subject by Matthew Turland, titled php|architect’s Guide to Web Scraping with PHP. It’s available in digital form now, and should be in print soon.
If you can pry yourself away from PHP for a moment, try a package called Beautiful Soup. This package has one huge advantage: unlike DOM/XSLT parsers, Beautiful Soup doesn’t choke if you direct it to parse an HTML page that has some bad markup. Since most web sites you will be scraping probably contain some mistakes, this is a pretty important advantage.