I’m currenlty trying to gather some datas from politifact using simple html dom, but a lot of the time I have weirds errors instead of the html expected.
The goal is not to bruteforce the site but to request it once or twice a day and cache the result.
Here most of the returns I get :
‹������í]{wÛ6²ÿ»=g¿ªn#»1EËJœÄ–µ×vœ&ÙÄñÚn²{r{|( ’S$ÇeuÛï~3न‡c'ÛísNÄ`f0˜Úß=}sxþ¯“#1ŠÆŽ8ùùàÕ‹CQ3Ló]ëÐ4Ÿž?ÿ|~þú•h66Åy`¹¡Ùžk9¦yt\µQù;¦9™L“...
And here’s the super simple code :
$html = file_get_html('http://www.politifact.com/personalities/barack-obama');
print_r($html->plaintext);
Do you have any ideas why ?
Some sort of protection/redirection on the website side ?
Thank you very much !
You received the expected page, but in gzip format. It looks like the server doesn’t mind if the
accept-encodingheader is not included in the request and instead of sending a default plain text response, sends a gzipped data anyway.I don’t think simple-html-dom can unzip the data, but you can use cURL for that purpose: