My RESTful WCF service accepts XML request bodies from clients, most clients are PHP applications.
The PHP applications encode their requests with htmlentities(), which is placed within the element tags. For example, a request to add a new user-account might look like this:
$body = "<user>
<userName>" . htmlentities( $userName ) . "</userName>
</user>"
The system works fine, there have been zero errors with it, until today.
I looked through the logs and saw this request had failed:
<user>
<userName>èeesu</userName>
</user>
with the following exceptions:
InvalidOperationException: “There is an error in XML document (4, 12).”
XmlException: “Character reference not valid. Line 4, position 12.”
(where line 4, position 12, refers to the <userName> element’s InnerText (i.e. the string èeesu;).
è is a valid HTML entity, but I understand that XML only defines a minimum set of character references (&, <, etc), and that XML expects all other characters to be in their document encoding representation instead, and so will reject things like è.
Can someone confirm this is the case? And if so, how can I get PHP to only encode XML-specific entities instead of HTML entities?
I use
htmlspecialchars( $userName, ENT_XML1 )instead, which only converts a minimum of characters to entities without unnecessarily encoding them.@Jordan’s str_replace function does the same thing, however when you benchmark it it’s slower because htmlspecialchars is a native function.