I’m working with XML data from an application where we get XML like this:
<elt attrib="Swedish: ä ö Euro: € Quotes: ‘ ’ “ ”">
Swedish: ä ö Euro: € Quotes: ‘ ’ “ ”
</elt>
I want the attribute value and inner text values to be
Swedish: ä ö Euro: € Quotes: ‘ ’ “ ”
but code like this:
Dim sXml As String = "<?xml version = ""1.0"" encoding = ""Windows-1252""?>" & vbCrLf & _
"<elt attrib=""Swedish: ä ö Euro: € Quotes: ‘ ’ “ ”"">" & _
"Swedish: ä ö Euro: € Quotes: ‘ ’ “ ”" & _
"</elt>"
Dim X As New XmlDocument
X.LoadXml(sXml)
TextBox1.Text = "Attribute: {" & X.DocumentElement.Attributes("attrib").Value & "}" & _
vbCrLf & "InnerText: {" & X.DocumentElement.InnerText & "}" & vbCrLf & _
"Length: " & Convert.ToString(Len(X.DocumentElement.InnerText))
or this:
Dim X As XDocument = XDocument.Parse(sXml)
TextBox1.Text = "Attribute: {" & X.Root.Attribute("attrib").Value & "}" & _
vbCrLf & "InnerText: {" & X.Root.Value & "}" & vbCrLf & _
"Length: " & Convert.ToString(Len(X.Root.Value))
give me:
{Swedish: ä ö Euro: Quotes: }
They both have the length correct at 36, so apparently where I want the Euro and quotes I’m getting something else, presumably based on a Unicode encoding.
First of all, numeric character entities are interpreted the same regardless of what the encoding of the input file. XML is defined strictly in terms of Unicode (any other encoding is mapped onto Unicode first), and numeric character entities represent Unicode codepoints.
Because of that, your XML, when treated as XML, has precisely the semantic meaning that you’ve got out of it using
XmlDocument, and no other. If you want to get another result, then you are really trying to parse it as not-quite-XML. Which is something no .NET XML API will let you do, not evenXmlReader(because it really isn’t supposed to be something that you can customize).The closest you can come to that is to first preprocess the input “XML” as text, replacing those numeric character entities with correct Unicode codepoints – for example, using
Regex. This can be tricky, however, because doing so for arbitrary input XML will require you to be able to distinguish where the expansion should not take place (e.g. inside CDATA blocks).