I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.
The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.
Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I’ll be happy.
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <title> </title> </head> <body contenteditable='true'> <p> Example paragraph content </p> <p> </p> <p> <br /> </p> <h1>Header 1</h1> </body> </html>
Conceptually, I’ve been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
…would do the trick, but it doesn’t seem to work at all with my test content in RegexBuddy.
Would this work ?
Of course, you need to add the necessary
\sin order to take into account< body ...>(element with spaces), as in:On second thought, I am not sure why I needed a negative look-ahead… This should also work (for a well-formed xhtml document):