What’s an example of something dangerous that would not be caught by the code below?
EDIT: After some of the comments I added another line, commented below. See Vinko’s comment in David Grant’s answer. So far only Vinko has answered the question, which asks for specific examples that would slip through this function. Vinko provided one, but I’ve edited the code to close that hole. If another of you can think of another specific example, you’ll have my vote!
public static string strip_dangerous_tags(string text_with_tags) { string s = Regex.Replace(text_with_tags, @'<script', '<scrSAFEipt', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'</script', '</scrSAFEipt', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'<object', '</objSAFEct', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'</object', '</obSAFEct', RegexOptions.IgnoreCase); // ADDED AFTER THIS QUESTION WAS POSTED s = Regex.Replace(s, @'javascript', 'javaSAFEscript', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onabort', 'onSAFEabort', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onblur', 'onSAFEblur', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onchange', 'onSAFEchange', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onclick', 'onSAFEclick', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'ondblclick', 'onSAFEdblclick', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onerror', 'onSAFEerror', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onfocus', 'onSAFEfocus', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onkeydown', 'onSAFEkeydown', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onkeypress', 'onSAFEkeypress', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onkeyup', 'onSAFEkeyup', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onload', 'onSAFEload', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onmousedown', 'onSAFEmousedown', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onmousemove', 'onSAFEmousemove', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onmouseout', 'onSAFEmouseout', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onmouseup', 'onSAFEmouseup', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onmouseup', 'onSAFEmouseup', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onreset', 'onSAFEresetK', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onresize', 'onSAFEresize', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onselect', 'onSAFEselect', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onsubmit', 'onSAFEsubmit', RegexOptions.IgnoreCase); s = Regex.Replace(s, @'onunload', 'onSAFEunload', RegexOptions.IgnoreCase); return s; }
It’s never enough – whitelist, don’t blacklist
For example
javascript:pseudo-URL can be obfuscated with HTML entities, you’ve forgotten about<embed>and there are dangerous CSS properties likebehaviorandexpressionin IE.There are countless ways to evade filters and such approach is bound to fail. Even if you find and block all exploits possible today, new unsafe elements and attributes may be added in the future.
There are only two good ways to secure HTML:
convert it to text by replacing every
<with<.If you want to allow users enter formatted text, you can use your own markup (e.g. markdown like SO does).
parse HTML into DOM, check every element and attribute and remove everything that is not whitelisted.
You will also need to check contents of allowed attributes like
href(make sure that URLs use safe protocol, block all unknown protocols).Once you’ve cleaned up the DOM, generate new, valid HTML from it. Never work on HTML as if it was text, because invalid markup, comments, entities, etc. can easily fool your filter.
Also make sure your page declares its encoding, because there are exploits that take advantage of browsers auto-detecting wrong encoding.