Overflowed Stack,
I have a Java web application (tomcat) whereby I allow the user to upload HTML code through a form.
Now since I am running on tomcat and I actually display the user-uploaded HTML I do not want a user to malicious code JSP tags/scriptlet/EL and for these to be executed on the server. I want to filter out any JSP/non-HTML content.
Writing a parser myself seems too onerous – apart from the lots of subtleties one has to take care of (comments, byte representation for the scripts etc).
Do you know of any API/library which does this for me ? I know about Caja filtering, but am looking at something specifically for JSPs.
Many Thanks,
JP, Malta.
Using a library for content cleaning is better than trying to do it yourself with e.g. Regexes.
Try Antisamy of the Open Web Application Security Project.
http://www.owasp.org/index.php/Antisamy
I didnt used it (yet), but seems to be suitable. JSP Content should be automatically removed/escaped by the HTML Normalization.
Edit, just found these:
Best Practice: User generated HTML cleaning
RegEx match open tags except XHTML self-contained tags