This is a bit of a weird one. I’m using HTTPClient 4.1.2, and it seems that whenever it finds are URL with something like a ‘#’ in it, it does a full get with the # in the URL.
For example, trying to get the URL http://stks.co/eWt will redirect to the URL http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter. Now this URL is live, but the problem is the HTTPClient sends a get request with the URI set to URI: /2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter which causes the server to send back a 404 page not found.
Looking at the GET sent by IE, Firefox and cURL, they all strip out the #… from the end of the URI, so for example the cURL GET request URI is set as URI: /2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/ – all the #… have been removed. This is for the exact same entry URL of http://stks.co/eWt.
As a test, sending this raw URL into HTTPClient (i.e. HttpGet httpget = new HttpGet("http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter");) gives the same 404 not found result.
So the question is are there any settings in HTTPClient that can be set so that things like the trailing #… can be auto removed from URLs. Or how would I go about manually removing this from URLs (remember that I would need to capture all redirect URLs as well)?
It sounds like their web server is broken. The URI specification says that a number sign (#) terminates the path portion of the URI. If a web server considers anything after a # part of the path, it is not following the URI specification.
I tested a few popular web servers, and they all parse these URIs correctly, ignoring the portion after the number sign.
I don’t have any good suggestions for a workaround though. But at least now you know who to blame.