It may seem like a subjective question but all I’m looking for is some hard and fast rules for when to use and when not to use HTML character references, particularly given the charset:
<meta http-equiv="content-Type" content="text/html; charset=utf-8" />
I’m picking up development on a company website from where someone else left off, and it seems like the previous developer encoded everything other than A-Z and 0-9 as HTML character references. For example every comma has been encoded as , and I’m not sure if this is a good thing.
Specifically is the following bad in terms of SEO?
<meta name='keywords' content='eriks industrial services, industrial products, industrial services, eriks, uk, european, leader, european leader, eriks, power transmission, power, bearings'/>
And specifically what characters must always be encoded as character references?
And for the sake of consistency is it better to avoid &name; and use &#DD; wherever possible?
Character references should be used when document creating/editing software, the data storage or a transport channel cannot store Unicode data or preserve the byte stream it is encoded to.
Practically this could mean that work needs to be done with legacy applications or with legacy configuration or with legacy transport protocols. In such cases it is possible that some part of the toolchain supports only 8-bit encodings or even ASCII only. Storing Unicode characters as such is not possible in such cases so reverting to character references on all but ASCII characters could be useful then, because that way you can avoid nasty encoding conversion problems that might appear when switching from 8-bit encodings to Unicode. Using named entities instead of character references is marginally more readable, but it unnecessarily complicates XML compatibility or migrating to XML, because using entities requires the presence of a DOCTYPE declaration or embedded DTD. This does not apply to
<,&,",&apos'and>which are pre-defined in XML.If you are working with a modern environment, using Unicode characters as such is generally preferred because often (the textual) data can be used without parsing/interpretation (e.g. direct searches from the text), it is easier and it will probably lead to more readable and thus more easily maintainable code.
The characters you must encode are
<and&and also"and'when they appear in an attribute value and the same character is used as an attribute value delimiter. In theory you should also escape>when it appears as a part of a]]>string that is not meant to end a CDATA section, but this is only for SGML compatibility and therefore not generally needed. These characters should be escaped using entities instead of character references. The need of escaping&applies also to URL values in<a href="...">which unfortunately is commonly forgotten.