I am reading the documentation for creating a podcast feed suitable for iTunes, and the Common Mistakes section says:
Using HTML Named Character Entities.
<! — illegal xml — >
<copyright>© 2005 John Doe</copyright>
<! — valid xml — >
<copyright>© 2005 John Doe</copyright>
Unlike HTML, XML supports only five
“named character entities”:
character name xml
& ampersand &
< less-than sign <
> greater-than sign >
’ apostrophe '
" quotation "
The five characters above are the only
characters that require escaping in
XML. All other characters can be
entered directly in an editor that
supports UTF-8. You can also use
numeric character references that
specify the Unicode for the character,
for example:
character name xml
© copyright sign ©
℗ sound recording copyright ℗
™ trade mark sign ™
For further reference see XML
Character and EntityReferences.
Right now I’m using htmlentities() under PHP5 and the feed is validating and working. But from what I gather some things that could get put into content might become entities that would make it no longer be valid. What’s the best function to use to assure I’m not passing along bad data? I’m paranoid something will get entered and get entity-ized and break the feed — should I just use str_replace() and replace with named entities and leave the rest alone? Or can I use htmlspecialchars() somehow?
So in short, what’s a drop-in replacement for htmentities() that will make sure input is safe for description, titles, etc in a podcast RSS feed?
You can either:
]]>, which cannot be put literally in a CDATA block.mb_encode_numericentityinstead ofhtmlentities(possibly combined withhtmlspecialcharsand a previous decoding of html entites withmb_convert_encoding).If the encoding of the XML file is UTF-8, you can just remove the entities. Suppose you have the following HTML fragment:
Then, you could just do: