I have this code which gets the HTML source of a page:
$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);
I want to scrape some content from it. For example, say the page’s source contains this:
<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />
Is there a way I could scrape this from the source and store it in a variable, so it’ll look like this:
technorati.com Connection failed
icerocket.com Connection failed
eblogs.com Done
Ect.
Of cause the page is dynamic which is why I’m having a problem. Could I maybe search for each site in the source? But then how would I get the result which is after it? (Connection failed / Done)
Thanks a lot for the help!
I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/
Then using code like this:
This results in something like: