I’m building a crawler for users to post links and get a preview of the contents of the page, and I can’t figure out why sometimes I get � when requesting a particular resource, even though Facebook seems to crawl it properly. I must be missing something.
I’m using HtmlAgilityPack to help me parse the HTML, and a default WebClient to help with making the actual requests. Here’s the relevant code:
using (ExtendedWebClient client = new ExtendedWebClient())
{
using (Stream stream = client.OpenRead(endpoint))
{
if (stream != null)
{
Encoding encoding = GetHttpResponseEncoding(client.ResponseHeaders);
HtmlDocument document = new HtmlDocument();
document.Load(stream, encoding);
return document.DeEntitize();
}
}
}
private Encoding GetHttpResponseEncoding(WebHeaderCollection headers)
{
Encoding encoding = Encoding.UTF8; // use UTF-8 by default.
string contentType = headers.Get("Content-Type");
if (contentType != null) // expected form: "text/html; charset=utf-8".
{
string[] keyValuePairs = contentType.Split(';');
foreach (string[] kvp in keyValuePairs.Select(kvp => kvp.Split('=')))
{
if (kvp.Length == 2 && kvp[0].Trim().ToLowerInvariant() == "charset")
{
// use the response header encoding.
return Encoding.GetEncoding(kvp[1]);
}
}
}
return encoding;
}
public static HtmlDocument DeEntitize(this HtmlDocument document)
{
string html = HtmlEntity.DeEntitize(document.DocumentNode.OuterHtml);
HtmlDocument decoded = new HtmlDocument();
decoded.LoadHtml(html);
return decoded;
}
The ExtendedWebClient just extends System.Net.WebClient by adding a UserAgent header impersonating a Firefox browser request.
The test code invokes the first piece of code with the following endpoint parameter:
new Uri("http://www.cronica.com.ar/diario/2012/07/30/30541-delpo-quiere-meterse-en-la-tercera-ronda.html")
Here’s a small snippet from the page:
Juan Mart�n Del Potro, que viene de vencer c�modamente al croata Ivan Dodig
Even when opening that link in a browser window (and looking at the source), I do get those enraging �.
The thing that is driving me nuts is that Facebook is able to read this properly. So what’s the issue here, are they stating their encoding is UTF-8 but not actually complying to that standard, or what am I missing from the picture?
Note that with this code I’m able to parse correctly pages like Facebook’s spanish home, which does have characters like ñ, which could mean trouble when facing encoding issues, but this is something else.
I think your parser is working fine. It’s just that the page either A) is using mixed/incorrect encoding or B) is actually writing the unicode replacement character ‘�’, ie the characters got munged somewhere before being output to the page (like going in/out the database). Where accents are correctly showing up, the page is using html entities, not the characters themselves.
if A) You could to try to detect coding (a pain, problematic).
if B) You can’t do anything.