I am trying to write a regular expression in C# to remove all script tags and anything contained within them.
So far I have come up with the following: \<([^:]*?:)?script\>[^(\</<([^:]*?:)?script\>)]*?\</script\>, however this does not work.
I’ll break it up and explain my thinking in each section:
\<([^:]*?:)?script\>
Here I am trying to state that it should get any script element, even if it is prefixed with a namespace, say, <a:script></a:script>. I have also added this to the closing tag.
[^(\</<([^:]*?:)?script\>)]*?
Here I am trying to state that it should allow anything to be contained within the tags except for </a:script>, </script>, etc.
\</script\>
Here I am stating that it should have a closing tag.
Can anyone spot where I am going wrong?
You can’t parse HTML with regular expressions.
Use the HTML Agility Pack instead.