I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can’t use existing high quality filters written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white list of options for these tags. For example. typical HTML input can consist of <b>, <i>, tags and <a> tag with href. But straightforward implementation is not good enough, because, even allowed simple links may include XSS:
<a href='javascript:alert('XSS')'>Click On Me</a>
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src — so I always need to check if it starts with (https?|ftp)://
Questions:
- Are these assumptions are good enough for most of purposes? Meaning that If I do not give an options for
styletags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can’t be fixes this way? - Is there a good reference for formal grammar of HTML/XHTML in order to write simple parser that would cleanup all incorrect of forbidden tags like
<script>
You can take a look at the Anti Samy project, trying to accomplish the same thing. It’s Java and .NET though.
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don’t know what is your requirements but I’d go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style=’color|background-color’ etc.
Consider this one:
<x style='express/**/ion:(alert(/bah!/))'>Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.