I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:
- It’s not known ahead of time whether a document contains HTML at all.
- More likely than not, any HTML will be very poorly formatted.
- Individual documents might be very large, perhaps hundreds of megabytes.
- Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of
<.+/?>are a no go. (And stripping XML is less desirable, anyway.)
I’m currently using HTML Agility Pack, and it’s just not cutting the mustard. Performance is poorer than I’d like, it doesn’t always handle truly awful formatting as gracefully as it could, and lately I’ve been running into problems with stack overflows on some of the more upsettingly large files.
I suspect that all of these problems stem from the fact that it’s trying to actually parse the data, which makes it a poor fit for my needs. I don’t want a syntax tree; I just want (most of) the tags to go away.
Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that’s not such a great idea. But that diatribe’s points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?
Assuming it isn’t a terrible idea, suggestions for regex that would do a good job are very welcome.
This regex finds all tags avoiding angle brackets inside quotes in tags.
It isn’t able to detect escaped quotes inside quotes (but I think it is unnecessary in html)
Having the list of all allowed tags and replacing it in the first part of the regex, like
<(tag1|tag2|...)could bring to a more precise solution, I’m afraid an exact solution can’t be found starting with your assumption about angle brackets, think for example to something like<a href="test.html"> b<a </a>…EDIT:
Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing
<script.+?</script>with nothing.