I’ve seen regex that can remove tags, which is great, but I also have stuff like
etc.
This isn’t actually from a HTML file. It’s actually from a string. I’m pulling down data from SharePoint web services, which gives me the HTML users might use/get generated like
<div>Hello! Please remember to clean the break room!!! "bob"e; <BR> </div>
So, I’m parsing through 100-900 rows with 8-20 columns each.
Take a look at the HTML Agility Pack, it’s an HTML parser that you can use to extract the
InnerTextfrom HTML nodes in a document.As has been pointed out many times here on SO, you can’t trust HTML parsing to a regular expression. There are times when it might be considered appropriate (for extremely limited tasks); but in general, HTML is too complex and too prone to irregularity. Bad things can happen when you try to parse HTML with Regular Expressions.
Using a parser such as HAP gives you much more flexibility. A (rough) example of what it might look like to use it for this task:
You can also perform XPATH queries on your document, in case you’re only interested in a specific node or set of nodes:
Hope this helps.