Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6038303
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T06:12:00+00:00 2026-05-23T06:12:00+00:00

As most (all?) PHP libraries that do HTML sanitization such as HTML Purifier are

  • 0

As most (all?) PHP libraries that do HTML sanitization such as HTML Purifier are heavily dependant on regex, I thought trying to write a HTML sanitizer that uses the DOMDocument and related classes would be a worthwhile experiment. While I’m at a very early stage with this, the project so far shows some promise.

My idea revolves around a class that uses the DOMDocument to traverse all nodes in the supplied markup, compare them to a white list, and remove anything not on the white list. (first implementation is very basic, only removing nodes based on their type but I hope to get more sophisticated and analyse the node’s attributes, whether links address items in a different domain, etc in the future).

My question is how do I traverse the DOM tree? As I understand it, DOM* objects have a childNodes attribute, so would I need to recurse over the whole tree? Also, early experiments with DOMNodeLists have shown you need to be very careful about the order you remove things otherwise you might leave items behind or trigger exceptions.

If anyone has experience with manipulating a DOM tree in PHP I’d appreciate any feedback you may have on the topic.

EDIT: I’ve built the following method for my HTML cleaning class. It recursively walks the DOM tree and checks whether the found elements are on the whitelist. If they aren’t, they are removed.

The problem I was hitting was that if you delete a node, the indexes of all subsequent nodes in the DOMNodeList changes. Simply working from bottom to top avoids this problem. It’s still a very basic approach currently, but I think it shows promise. It certainly works a lot faster than HTMLPurifier, though admittedly Purifier does a lot more stuff.

/**
 * Recursivly remove elements from the DOM that aren't whitelisted
 * @param DOMNode $elem
 * @return array List of elements removed from the DOM
 * @throws Exception If removal of a node failed than an exception is thrown
 */
private function cleanNodes (DOMNode $elem)
{
    $removed    = array ();
    if (in_array ($elem -> nodeName, $this -> whiteList))
    {
        if ($elem -> hasChildNodes ())
        {
            /*
             * Iterate over the element's children. The reason we go backwards is because
             * going forwards will cause indexes to change when elements get removed
             */
            $children   = $elem -> childNodes;
            $index      = $children -> length;
            while (--$index >= 0)
            {
                $removed = array_merge ($removed, $this -> cleanNodes ($children -> item ($index)));
            }
        }
    }
    else
    {
        // The element is not on the whitelist, so remove it
        if ($elem -> parentNode -> removeChild ($elem))
        {
            $removed [] = $elem;
        }
        else
        {
            throw new Exception ('Failed to remove node from DOM');
        }
    }
    return ($removed);
}
  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T06:12:01+00:00Added an answer on May 23, 2026 at 6:12 am

    For a start, you can have a look at this custom RecursiveDomIterator:

    • https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator

    Code:

    class RecursiveDOMIterator implements RecursiveIterator
    {
        /**
         * Current Position in DOMNodeList
         * @var Integer
         */
        protected $_position;
    
        /**
         * The DOMNodeList with all children to iterate over
         * @var DOMNodeList
         */
        protected $_nodeList;
    
        /**
         * @param DOMNode $domNode
         * @return void
         */
        public function __construct(DOMNode $domNode)
        {
            $this->_position = 0;
            $this->_nodeList = $domNode->childNodes;
        }
    
        /**
         * Returns the current DOMNode
         * @return DOMNode
         */
        public function current()
        {
            return $this->_nodeList->item($this->_position);
        }
    
        /**
         * Returns an iterator for the current iterator entry
         * @return RecursiveDOMIterator
         */
        public function getChildren()
        {
            return new self($this->current());
        }
    
        /**
         * Returns if an iterator can be created for the current entry.
         * @return Boolean
         */
        public function hasChildren()
        {
            return $this->current()->hasChildNodes();
        }
    
        /**
         * Returns the current position
         * @return Integer
         */
        public function key()
        {
            return $this->_position;
        }
    
        /**
         * Moves the current position to the next element.
         * @return void
         */
        public function next()
        {
            $this->_position++;
        }
    
        /**
         * Rewind the Iterator to the first element
         * @return void
         */
        public function rewind()
        {
            $this->_position = 0;
        }
    
        /**
         * Checks if current position is valid
         * @return Boolean
         */
        public function valid()
        {
            return $this->_position < $this->_nodeList->length;
        }
    }
    

    You can use that in combination with a RecursiveIteratorIterator. Usage examples are on the page.

    In general though, it would be easier to use XPath to search for blacklisted nodes instead of traversing the DOM Tree. Also keep in mind that DOM is already quite good at preventing XSS by automatically escaping xml entities in nodeValues.

    The other thing you have to be aware of is that any manipulation of a DOMDocument will immediately affect any DOMNodeList you might have from XPath queries and that might lead to skipped nodes when manipulating them. See DOMNode replacement with PHP's DOM classes for an example.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Most of my PHP apps have an ob_start at the beginning, runs through all
I'm just curious if any project exists that attempts to group all (or most)
The question of whether P=NP is perhaps the most famous in all of Computer
Are most flash video players created all programmatically? Or they done using static buttons
Spending most of my time in Visual Studio and using all the IDE tools,
Most references I've seen, and my IDE's code completion all have my specifying a
What is the most efficient to fill a ComboBox with all the registered file
I was wondering when most people wrote their unit tests, if at all. I
I've been looking into ways to compress PHP libraries, and I've found several libraries
Having used some PHP frameworks such as Codeigniter and Kohana for some smaller sites,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.