I am trying to obtain data from a Web and show it to the user using cURL and Simple HTML Dom PHP class.
Some pages have a redirection depending on the client’s language, I am using a function to determine the final page that is to be scraped.
In order to show it as the user would see it, I am using this:
$useragent = $_SERVER['HTTP_USER_AGENT'];
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
At the moment most of my current users are Spanish speakers, therefore I am temporarily limiting accepted languages so if there is a language redirect on the target page, it will show Spanish or English first.
$header[] = "Accept-Language: es-es,es;q=0.8,en-us;q=0.5,en;q=0.3";
However, since my server is located in the Netherlands and some pages have an IP-based redirector, sometimes the pages redirect to the /nl/ directory, ignoring the language parameters.
This happens, for example, with the http://www.econsultancy.com Website.
Is it possible to avoid this kind of redirect, maybe using the client’s IP address in the cURL request?
Also, is it possible to use the client’s browser language settings to make the Accept-Language parameter dynamic?
Here’s the entire function script:
<?
function redirector($originalurl) {
$ch = curl_init();
$useragent = $_SERVER['HTTP_USER_AGENT'];
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: es-es,es;q=0.8,en-us;q=0.5,en;q=0.3";
$header[] = "Pragma: ";
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $originalurl);
$out = curl_exec($ch);
$out = str_replace("\r", "", $out);
$headers_end = strpos($out, "\n\n");
if( $headers_end !== false ) {
$out = substr($out, 0, $headers_end);
}
$headers = explode("\n", $out);
foreach($headers as $header) {
if( substr($header, 0, 10) == "Location: " ) {
$target = substr($header, 10);
$targeturl = $target;
}
}
return $targeturl;
}
?>
Thanks in advance!
Some of IP based redirections are really stubborn (and it’s almost impossible to switch certain pages to English from <whatever page thinks your language is>), but you may try to intercept any redirection by using
CURLOPT_FOLLOWLOCATIONset toFalseand parsingLocationheader (this solution requires you to guess URL correctly):Edit – per site
If you can afford to do this on “per site” basis (to create function for each site to switch language) you may trace what’s happening when you are switching languages (for example Firefox has perfect plugin for this) and most of the time you’ll end up using:
/nl/,lang=nl,l=nl, … in the URLPOSTusername and password)With a little luck you’ll be good with what you’re having already in combination of “large array” of cookie values pair like this:
But once you encounter two pages which uses the same “cookie variable name” and different values:
You’re screwed and you’ll have to use some sort of
switch($domain)again.