I’m looking for an efficient means of extracting an html “fragment” from an html document. My first implementation of this used the Html Agility Pack. This appeared to be a reasonable way to attack this problem, until I started running the extraction on large html documents – performance was very poor for something so trivial (I’m guessing due to the amount of time it was taking to parse the entire document).
Can anyone suggest a more efficient means of achieving my goal?
To summarize:
-
For my purposes, an html “fragment”
is defined as all content inside of
the<body>tags of an html
document -
Ideally, I’d like to return the
content unaltered if it didn’t
contain an<html>or<body>
(I’ll assume I was passed an html
fragment to begin with) -
I have the entire html document available in memory (as a string), I won’t be streaming it on demand – so a potential solution won’t need to worry about that.
-
Performance is critical, so a potential solution should account for this.
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
A solution in C# or VB.NET would be welcome.
Most html is not going to be XHTML compliant. I would do an HTTP get request and search the resultant text for
.Contains("<body>")and.Contains("</body>"). You can use these two locations as your start and stop indexes for a reader stream. Outside the body tag you really don’t need to worry about XML compliance.