I am writing a scraper and I have the following code:
//Open link prepended with domain
$link='http://www.domain.de/'.$link;
$data=@file_get_contents($link);
$regex='#<span id="bandinfo">(.+?)<br><img src=".*?" title=".*?" alt=".*?" > (.+?) (.+?)<br>(.+?)<br><a href=".*?">Mail-Formular</a> <img onmouseover=".*?" onmouseout=".*?" onclick=".*?" style=".*?" src=".*?" alt=".*?"> <br><a href="tracklink.php.*?>(.+?)</a></span>#';
preg_match_all($regex,$data,$match2);
foreach($match2[1] as $info) echo $info."<br/>";
As you can see, I need to select several things in the regexp. However, at the bottom when I echo it out, it always only gives the first thing selected.
I thought in the array there are all selected things then? I need to save them in variables, but do not know how to access them.
You should not us regex to parse html, heres a simple function ive put together that uses domDocument plus curl as its faster.
Example scrape:
Looking for all links
athat have anonmouseoutattributewith a value of
return nd();:Or second example looking for a
divwith aclassattribute calledbandinfo:Or an image contained within a onclick in some javascript:
Get all
imgtags withonclicksThe actual dom function:
And the curl function:
Hope it helps.