When scrapping i. e. http://baidu.com, script doesn’t follow <meta.. refresh..> redirect. The code I’m running:
require_once 'HTTP/Request2.php';
$request = new HTTP_Request2("http://baidu.com", HTTP_Request2::METHOD_GET);
$request->setConfig(array(
'adapter' => 'HTTP_Request2_Adapter_Curl',
'connect_timeout' => 15,
'timeout' => 30,
'follow_redirects' => TRUE,
'max_redirects' => 10,
));
try {
$response = $request->send();
if (200 == $response->getStatus()) {
$html = $response->getBody();
} else {
echo 'Unexpected HTTP status: ' . $response->getStatus() . ' ' .
$response->getReasonPhrase();
}
} catch (HTTP_Request2_Exception $e) {
echo 'Error: ' . $e->getMessage();
}
print $html;
outputs:
<html>
<meta http-equiv="refresh" content="0;url=http://www.baidu.com/">
</html>
Is there a way to make it follow this redirect, to get proper html in $response->getBody()?
The PEAR library does follow HTTP redirects since these are declared in the request header. The example you show in your question is an HTML meta refresh – a different mechanism.
What you’ll want to do is read the response to the HTTP request made via PEAR and parse the “meta refresh” tag, then make a second request to the URI that you managed to scrape out of the first request.
Below is an example of a function that will do this taken from a comment left on the PHP manual.
This snippet was found here: http://php.net/manual/en/function.get-meta-tags.php
As I explained, you can do something like the following:
You may want to re-implement the getURLContents function so that it uses PEAR to get the first URL if this is your preferred method for making HTTP calls.