In getting some random spanish newspaper’s index I don’t get the diacriticals properly using WebRequest, they yield this weird character: �, while downloading the response from the same uri using a WebClient I get the appropriate response.
Why is this differentiation?
var client = new WebClient();
string html = client.DownloadString(endpoint);
vs
WebRequest request = WebRequest.Create(endpoint);
using (WebResponse response = request.GetResponse())
{
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string html = reader.ReadToEnd();
}
You’re just assuming that the entity is in UTF-8 when creating your stream-reader without explicitly setting the encoding. You should examine the
CharacterSetof theHttpWebResponse(not exposed by theWebResponsebase class), and open theStreamReaderwith the appropriate encoding.Otherwise, if it reads something that’s not UTF-8 as if it was UTF-8, it’ll come across octet-sequences that aren’t valid in UTF-8 and have to substitute in U+FFFD replacement character (
�) as the best it can do.WebClient does pretty much this:
DownloadStringis a higher level method, that whereWebRequestand its derived classes let you get in lower, it has a single call for “send a GET request to the URI, examine the headers to see what content-encoding is in use, in case you need to un-gzip or de-compress it, see what character-encoding is in place, set up a text-reader with that encoding and the stream, and then callReadAll()“. The normal high-level-big-chunk-instructions vs low-level-small-chunk-instructions pros and cons apply.