I’ve tried to understand a few examples, including questions here so I apologise if this seems to me a duplicate but I cannot find a RegularExpression I can understand.
I have some HTML to parse using an XML parser – but I want to strip out the <head> </head> tags from this content as the rest is valid enough for normal XML Parsing.
The tags <head> to </head> must be removed and their content so that the outer HTML is not affected <body> tags etc.
This is the section including the Head HTML I want removed for reference:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<html>
<head>
<link rel="stylesheet" type="text/css" href="/style/stylesheet.css" />
<meta name="description" content="Information" />
<base target="_top">
</head>
<body>
<!-- Body Here -->
</body>
</html>
I also need to strip the DocType, if this can be done using a RegEx then that would be great. The head is always the same – I want to remove from <head> to </head> inclusive only and if possible remove the DOCTYPE from the Text also.
Also this will need to work in Silverlight and use System.Text.RegularExpressions or similar to work.
Extracting the Body was easier – here is the RegEx I am using:
Now I can parse that normally with LINQ-to-XML!