I’ve been struggling to convert a html value of an attribute, without any success.
Here is the the HTML i am trying to convert (sure the charset will not be shown here, but, i see it exactly as you see it).
<a href="https://sistemas.usp.br/jupiterweb/listarGradeCurricular?codcg=12&codcur=12012&codhab=1&tipo=N" target="_blank">Administração – São Paulo – diurno</a>
All right, the VALUE of this htmlnode is “Administração – São Paulo – diurno”.
I am using HtmlAgilityPack to parse the HtmlPage for this, and once i reach this node, its innerText value is just like this : Administração â São Paulo â diurno
I am assuming the original charset of the page is UTF-8 because thats what the encoding tag on the html says to me.
How can i convert this weird string to : Administração - São Paulo - Diurno ?
I’ve tried these threads already : thread one and thread two
and nothing solved my issue
EDIT: I am getting the page via a C# WebRequest Get.
EDIT2 : Added HtmlAgilityPack tag
The problem was isolated : WebRequest is messing the Html sometimes.
Is there any other way to set the encoding ? i am trying : _webReq.Encoding = “ISO-8859-1”
Thanks in advance
After a small test, you can see that the string is not properly getting Encoded back to its original form.
Sample test:
This prints:
As you can see, the original string is being converted to bytes using UTF8, but then it is being converted back to a string using Default encoding.
This is wrong.
If WebRequest.GetResponse() is returning the string as the wrong value, then there is a problem with that method. Try setting the TransferEncoding property on the HttpWebRequest to UTF8.
Or you can try to set the Encoding to UTF8 on the StreamReader you open. Can I see your code?