i have made a crawler, but i can´t understand how i can go through a pagination, can someone please help me with this, thanks.
Here is my crawler script:
if(!$fp = fopen("https://market.android.com/details?id=apps_topselling_paid&cat=LIBRARIES_AND_DEMO&start=0&num=24" ,"r" )) {
return false;
}
$content = "";
while(!feof($fp)) {
$content .= fgets($fp, 1024);
}
fclose($fp);
if (!preg_match('/error-section/i', $content)) {
preg_match_all("/id=([^/i", $content, $matches, PREG_SET_ORDER);
$i=1;
foreach ($matches as $val) {
$link = $val[1];
if(!$fps = fopen("https://market.android.com/details?id=". $link ,"r" )) {
return false;
}
$content_app = "";
while(!feof($fps)) {
$content_app .= fgets($fps, 1024);
}
fclose($fps);
preg_match("/([^/i", $content_app, $regs);
echo $regs[1]. "
;
}
}else{
echo 'Error page not found!';
}
I assume that the pagination is something similar to comment pagination on blogs.
One way is to find the link to the next page, and follow that link. It can be done quite easily with a regex.
Another way, if you are crawling a single site, is to figure out their url-structure of the pagination, and then just scan pages incrementally until there are no more comments.