I am looking for a regex statement that will let me extract the HTML

Question

0

Asked: May 10, 20262026-05-10T22:25:24+00:00 2026-05-10T22:25:24+00:00

I am looking for a regex statement that will let me extract the HTML

0

I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I’ll be happy.

<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN'     'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'> <html xmlns='http://www.w3.org/1999/xhtml'>   <head>     <title>     </title>   </head>   <body contenteditable='true'>     <p>       Example paragraph content     </p>     <p>       &nbsp;     </p>     <p>       <br />       &nbsp;     </p>     <h1>Header 1</h1>   </body> </html>

Conceptually, I’ve been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:

((.|\n)*<body (.)*>)|((</body>(*|\n)*)

…would do the trick, but it doesn’t seem to work at all with my test content in RegexBuddy.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T22:25:25+00:00

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead… This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am looking for a regex statement that will let me extract the HTML

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply