I want to allow users to create tiny templates that I then render in Django with a predefined context. I am assuming the Django rendering is safe (I asked a question about this before), but there is still the risk of cross-site-scripting, and I’d like to prevent this. One of the main requirements of these templates is that the user should have some control over the layout of the page, not just it’s semantics. I see a couple of solutions:
- Allow the user to use HTML, but filter out dangerous tags manually in the final step (things like
<script>and<a onclick='..'>. I’m not so enthusiastic about this option, because I’m afraid I might overlook some tags. Even then, the user could still use absolute positioning on<divs>to mess up a thing or two on the rest of the page. - Use a markup language that produces safe HTML. From what I can see, in most markup languages, I could strip any html, and then process the result. The problem with this is that most markup languages are not very powerful layout-wise. As far as I could see there is no way to center elements in Markdown, not even in ReST. The pro here is that some markup languages are well-documented, and users might already know how to use them.
- Come up with some proprietary markup. The cons I see here are pretty much all implied by the word proprietary.
So, to summarize: Is there some safe and easy way to “purify” HTML — preventing xss — or is there a reasonably ubiquitous markup language that gives some control over layout and styling.
There’s PHP-Based HTML purifier, I have not used it myself yet but heard very good things about it. They promise a lot:
Maybe it’s worth a try even though it’s not Python based. Update: @Matchu has found a Python based alternative that looks good too.
You’ll have a lot of very difficult edge cases, though, just think about Flash embeds. Plus, malicious uses of
position: absoluteare extremely difficult to track down (there’sposition: relativethat could achieve the same effect, but also be a completely legitimate layout tool.) Maybe take a look at what – for example – EBay allow, and don’t allow? If anybody has the necessary experience to know what’s dangerous and what isn’t from millions of examples, they do.Related resources on EBay:
HTML & JavaScript with examples
Site Interference it’s unclear, though, what is just forbidden, and what gets filtered
From what I found, they don’t seem to publish their internal HTML blacklists, but output an error message if forbidden code is found. (Probably a wise move on their part, but unfortunate for the purposes of this question.)