I have some code that removes HTML tags from text. I don’t care about the content (script, css, text etc), the important thing, at least for now, is that the tags themselves are stripped out.
This may be entering the theatre of micro-optimisation, however this code is among a small number of functions that will be running very often against large amounts of data, so any percentage saving may carry through to a useful saving from the overall application’s perspective.
The code at present looks like this:
public static string StripTags(string html)
{
var currentIndex = 0;
var insideTag = false;
var output = new char[html.Length];
for (int i = 0; i < html.Length; i++)
{
var c = html[i];
if (c == '>')
{
insideTag = false;
continue;
}
if (!insideTag)
{
if (c == '<')
{
insideTag = true;
continue;
}
output[currentIndex] = c;
currentIndex++;
}
}
return new string(output, 0, currentIndex);
}
Are there any obvious .net tricks I’m missing out on here? For info this is using .net 4.
Many thanks.
In this code you copy chars one by one. You might be able to speed it up considerably by only checking where the current section (inside or outside html) ends and then use Array.copy to move that whole chunk in one go, this would enable lower level optimizations. (for instance on 64 bit it could copy 4 unicode chars (4 * 2* 8 bit) in one processor cycle). The bits of text in between the tags are probably quite large so this could add up.
Also the stringbuilder documentation mentioned somewhere that becuase it’s implemented in the framework and not in C# it has perfomance that you can’t replicate in managed C#. Not sure how you could append a chunk you might look into that.
Regards Gert-Jan