Since php 5.4 html_entity_decode introduces four new flags, with a minimal explanation
ENT_HTML401 Handle code as HTML 4.01.
ENT_XML1 Handle code as XML 1.
ENT_XHTML Handle code as XHTML.
ENT_HTML5 Handle code as HTML 5.
I want to understand what are they for. In which cases are they significant?
My guess, (but may I be wrong) is that any different standard, encodes some unusual chars but any other don’t, so in order to respect that, they are here.
My research: htmlentities has the same minimal explanation, with no examples too. I have googled with no luck.
I started wondering what behavior these constants have when I saw these constants at the htmlspecialchars page. The documentation was rubbish, so I started digging in the source code of PHP.
Basically, these constants affect whether certain entities are encoded or not (or decoded for
html_entity_decode). The most obvious effect is whether the apostrophe (') is encoded to'(forENT_HTML401) or'(for others). Similarly, it determines whether'is decoded or not when usinghtml_entity_decode. ('is always decoded).All usages can be found in ext/standard/html.c and its header file. From ext/standard/html.h:
(replace
ENT_HTML_DOC_byENT_to get their PHP constant names)I started looking for all occurrences of these constants, and can share the following on the behaviour of the
ENT_*constants:gets decoded to an unreadable/invalid character forENT_HTML401, andENT_XHTMLandENT_XML1. ForENT_HTML5however, this is considered an invalid character and hence it stays. (C function unicode_cp_is_allowed)ENT_SUBSTITUTEenabled, invalid code unit sequences for a specified character set are replaced with�. (does not depend on document type!)ENT_DISALLOWEDenabled, code points that are disallowed for the specified document type are replaced with�. (does not depend on charset!)ENT_IGNORE, the same invalid code unit sequences fromENT_SUBSTITUTEare removed and no replacement is done (depends on choice of “document type”, e.g.ENT_HTML5)
forENT_HTML5(line 976)ENT_XHTMLshares the entity map withENT_HTML401. The only difference is that'will be converted to an apostrophe withENT_XHTMLwhileENT_HTML401does not convert it (see this line)ENT_HTML401andENT_XHTMLuse exactly the same entity map (minus the difference from the previous point).ENT_HTML5uses its own map. Others (currentlyENT_XML1) have a very limited decoding map (>,&,<,',"and their numeric equivalents). (see C function unescape_inverse_map)htmlspecialchars), all entities map will use the same one asENT_XML1, except forENT_HTML401. That one will not use', but'.That covers almost everything. I am not going to list all entity differences, instead I would like to point at https://github.com/php/php-src/tree/php-5.4.11/ext/standard/html_tables for some text files that contain the mappings for each type.
What ENT_* should I use for htmlspecialchars?
When using
htmlspecialcharswith ENT_COMPAT (default) or ENT_NOQUOTES, it does not matter which one you pick (see below). I saw some answers here on SO that boils down to this:This is insecure. It will override the default value
ENT_HTML401 | ENT_COMPATwhich has as difference that HTML5 entities are used, but also that quotes are not escaped anymore! In addition, this is redundant code. The entities that have to be encoded byhtmlspecialcharsare the same for allENT_HTML401,ENT_HTML5, etc.Just use
ENT_COMPATorENT_QUOTESinstead. The latter also works when you use apostrophes for attributes (value='foo'). If you only have two arguments forhtmlspecialchars, do not include the argument at all since it is the default (ENT_HTML401is 0, remember?).When you want to print something on the page (between tags, not attributes), it does not matter at all which one you pick as it will have equal effect. It is even sufficient to use
ENT_NOQUOTES | ENT_HTML401which equals to the numeric value0.See also below, about ENT_SUBTITUTE and ENT_DISALLOWED.
What ENT_* should I use for htmlentities?
If your text editor or database is so crappy that you cannot include non-US-ASCII characters (e.g. UTF-8), you can use htmlentities. Otherwise, save some bytes and use htmlspecialchars instead (see above).
Whether you need to use
ENT_HTML401,ENT_HTML5or something else depends on how your page is served. When you have a HTML5 page (<!doctype html>), useENT_HTML5. XHTML or XML? Use the correspondingENT_XHTMLorENT_XML1. With no doctype or plain ol’ HTML4, useENT_HTML401(which is the default when omitted).Should I use ENT_DISALLOWED, ENT_IGNORE or ENT_SUBSTITUTE?
By default, byte sequences that are invalid for the given character set are removed. To have a
�in place of an invalid byte sequence, specifyENT_SUBSTITUTE. (note that&#FFFD;is shown for non-UTF-8 charsets). When you specifyENT_IGNOREthough, these characters are not shown even if you specifiedENT_SUBSTITUTE.Invalid characters for a document type are substituted by the same replacement character (or its entity) above when
ENT_DISALLOWEDis specified. This happens regardless of havingENT_IGNOREset (which has nothing to do with invalid chars for doctypes).