I built a tool that takes arbitrary HTML, collects all the classes and ids and outputs them back into the page. I am concerned about security. I had been using HTML Purifier to filter the input, but I need to support HTML5, which HTML Purifier does not.
This is the gist of the tool:
$html=$_POST['html'];
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//body");
foreach ($elements as $element) {
$nodes = $element->childNodes;
$output=write_selectors($nodes);
}
function write_selectors($nodes){
foreach($nodes as $node){
$node->getAttribute('id');
.
.
.
$node->getAttribute('class');
.
.
.
}
.
.
.
return 'string containing all classes and ids in the document'
}
.
.
.
echo htmlentities($output, ENT_QUOTES);
My questions are:
It seems like it would be possible for someone to put a string like this into the tool: '<div '); do_bad_stuff( 'ha_ha_ha' and that $doc->loadHTML($html); would end up saying $doc->loadHTML('<div '); do_bad_stuff( 'ha_ha_ha');
It seems like DOMDocument just errors when I try to input funny business like that, but should I be doing something to protect against malicious inputs? If not, why not?
Secondarily, is htmlentities enough to sanitize the output?
No,it will never do that. You have $html in a variable and sent that variable directly into the function.
Yes, personally I would use htmlspecialchars, but htmlentities is fine to protect you from XSS.