I’m trying to retrieve this page using Apache HttpClient: http://quick-dish.tablespoon.com/
Unfortunately, when I try to do this, it just returns the following (as returned by JSoup, so probably it’s really just returning the HTTP… string itself):
<html>
<head></head>
<body>
HTTP/1.1 200 OK [Server: nginx/1.0.11, Content-Type: text/html;charset=UTF-8, Last-Modified: Mon, 02 Jul 2012 15:30:40 GMT, Vary: Accept-Encoding, Cookie,Accept-Encoding, X-Powered-By: PHP/5.3.6, X-Pingback: http://quick-dish.tablespoon.com/xmlrpc.php, X-Powered-By: ASP.NET, Content-Encoding: gzip, X-Blz: lb1.blaze.io, Date: Mon, 02 Jul 2012 16:06:21 GMT, Content-Length: 11723, Connection: keep-alive]
</body>
</html>
Here is my code (note that I’m emulating the Google Bot as I’ve found that web servers tend to be better behaved that way):
URL sourceURL = new URL("http://quick-dish.tablespoon.com/");
HttpClient httpClient = new ContentEncodingHttpClient();
httpClient.getParams().setBooleanParameter("http.protocol.handle-redirects", true);
final HttpGet httpget = new HttpGet(sourceURL.toURI());
httpget.setHeader("User-Agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)");
httpget.setHeader("Accept", "text/html");
httpget.setHeader("Accept-Charset", "utf-8");
final HttpResponse response = httpClient.execute(httpget);
return Jsoup.parse(response.toString());
Needless to say, the page returns fine in my web browser. Any ideas?
Instead of toString you need to get the response entity
Then you can get the contents of that