In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i’m using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I’m scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn’t clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
…imagine 1000 of these…
a
a
a
a
A random suggestion, inspired by slashdot.org’s comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you’ll have to do some experimentation to find a useful cut-off) reject it.