Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6238629
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T11:14:57+00:00 2026-05-24T11:14:57+00:00

I’m trying to code a secure and lightweight white-list based HTML purifier which will

  • 0

I’m trying to code a secure and lightweight white-list based HTML purifier which will use DOMDocument. In order to avoid unnecessary complexity I am willing to make the following compromises:

  • HTML comments are removed
  • script and style tags are stripped all together
  • only the child nodes of the body tag will be returned
  • all HTML attributes that can trigger Javascript events will either be validated or removed

I’ve been reading a lot about on XSS attacks and prevention and I hope I’m not being too naive (if I am, please let me know!) in assuming that if I follow all the rules I mentioned above, I will be safe from XSS.

The problem is I am not sure what other tags and attributes (in any [X]HTML version and/or browser versions/implementations) can trigger Javascript events, besides the default Javascript event attributes:

  • onAbort
  • onBlur
  • onChange
  • onClick
  • onDblClick
  • onDragDrop
  • onError
  • onFocus
  • onKeyDown
  • onKeyPress
  • onKeyUp
  • onLoad
  • onMouseDown
  • onMouseMove
  • onMouseOut
  • onMouseOver
  • onMouseUp
  • onMove
  • onReset
  • onResize
  • onSelect
  • onSubmit
  • onUnload

Are there any other non-default or proprietary event attributes that can trigger Javascript (or VBScript, etc…) events or code execution? I can think of href, style and action, for instance:

<a href="javascript:alert(document.location);">XSS</a> // or
<b style="width: expression(alert(document.location));">XSS</b> // or
<form action="javascript:alert(document.location);"><input type="submit" /></form>

I will probably just remove any style attributes in the HTML tags, the action and href attributes pose a bigger challenge but I think the following code is enough to make sure their value is either a relative or absolute URL and not some nasty Javascript code:

$value = $attribute->value;

if ((strpos($value, ':') !== false) && (preg_match('~^(?:(?:s?f|ht)tps?|mailto):~i', $value) == 0))
{
    $node->removeAttributeNode($attribute);
}

So, my two obvious questions are:

  1. Am I missing any tags or attributes that can trigger events?
  2. Is there any attack vector that is not covered by these rules?

After a lot of testing, pondering and researching I’ve come up with the following (rather simple) implementation which, appears to be immune to any XSS attack vector I could throw at it.

I highly appreciate all your valuable answers, thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T11:14:58+00:00Added an answer on May 24, 2026 at 11:14 am

    You mention href and action as places javascript: URLs can appear, but you’re missing the src attribute among a bunch of other URL loading attributes.

    Line 399 of the OWASP Java HTMLPolicyBuilder is the definition of URL attributes in a white-listing HTML sanitizer.

    private static final Set<String> URL_ATTRIBUTE_NAMES = ImmutableSet.of(
      "action", "archive", "background", "cite", "classid", "codebase", "data",
      "dsync", "formaction", "href", "icon", "longdesc", "manifest", "poster",
      "profile", "src", "usemap");
    

    The HTML5 Index contains a summary of attribute types. It doesn’t mention some conditional things like <input type=URL value=...> but if you scan that list for valid URL and friends, you should get a decent idea of what HTML5 adds. The set of HTML 4 attributes with type %URI is also informative.

    Your protocol whitelist looks very similar to the OWASP sanitizer one. The addition of ftp and sftp looks innocuous enough.

    A good source of security related schema info for HTML element and attributes is the Caja JSON whitelists which are used by the Caja JS HTML sanitizer.

    How are you planning on rendering the resulting DOM? If you’re not careful, then even if you strip out all the <script> elements, an attacker might get a buggy renderer to produce content that a browser interprets as containing a <script> element. Consider the valid HTML that does not contain a script element.

    <textarea><&#47;textarea><script>alert(1337)</script></textarea>
    

    A buggy renderer might output the contents of this as:

    <textarea></textarea><script>alert(1337)</script></textarea>
    

    which does contain a script element.

    (Full disclosure: I wrote chunks of both HTML sanitizers mentioned above.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to understand how to use SyndicationItem to display feed which is
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka
Basically, what I'm trying to create is a page of div tags, each has
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
I used javascript for loading a picture on my website depending on which small
I want use html5's new tag to play a wav file (currently only supported
I am trying to loop through a bunch of documents I have to put
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.