I’m creating a chat widget for my web site. The users will be able to input straight text – no html.
In an effort to eliminate HTML tags AND to allow users to use “<” and “>”, I am taking their input and sanitizing it using strip_tags() on the input and htmlentities() on the output to the users’ screens — using php. One problem is that if a user inputs “Russia<China” strip_tags() will greedily eliminate everything after the “<“.
My question is … if I use regex to create a space between a “<” and the next non-space character, will that help me eliminate the threat of XSS? Will it prevent a potential HTML tag to render on the user’s screen?
Say, if something like this slips through:
< script type=’text/javascript’>alert(‘some malicious code’);< /script>
One advantage in creating that space (e.g. < script… >) seems to be that strip_tags() will leave the “<” alone.
Thanks for any suggestions.
The added space is enough to stop tags from being stripped by
strip_tags, and from being rendered as HTML by browsers.But at what point exactly would you use such a regular expression? If you add it after you’ve done
strip_tags(), legitimate text will already have been stripped. If you add it beforestrip_tags(), there won’t be any tags left to strip, so users will see the spaced HTML tags in text.But if they’re going to see (mangled) tags anyway, why are you doing this at all? You can just do
htmlspecialchars(), which you have to do anyway.Even a HTML parser isn’t going to help you, because a HTML parser would consider the
<Chinain your example a tag too.And is the person typing
a<bmaking a comparison, talking about HTML, trying to add emphasis, or trying to inject a malicious script?