I have html with nested elements (mostly just div and p elements)
I need to return the same html, but substring’ed by a given number of letters. Obviously the letter count should not enumerate through html tags, but only count letters of InnerText of each html element.
Html result should preserve proper structure – any closing tags in order to stay valid html.
Sample input:
<div>
<p>some text</p>
<p>some more text some more text some more text some more text some more text</p>
<div>
<p>some more text some more text some more text some more text some more text</p>
<p>some more text some more text some more text some more text some more text</p>
</div>
</div>
Given int length = 16 the output should look like this:
<div>
<p>some text</p> // 9 characters in the InnerText here
<p>some mo</p> // 7 characters in the InnerText here; 9 + 7 = 16;
</div>
Notice that the number of letters (including spaces) is 16. The subsequent <div> is eliminated since the letter count has reached variable length. Notice that output html is still valid.
I’ve tried the following, but that does not really work. The output is not as expected: some html elements get repeated.
public static string SubstringHtml(this string html, int length)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
int totalLength = 0;
StringBuilder output = new StringBuilder();
foreach (var node in doc.DocumentNode.Descendants())
{
totalLength += node.InnerText.Length;
if(totalLength >= length)
{
int difference = totalLength - length;
string lastPiece = node.InnerText.ToString().Substring(0, difference);
output.Append(lastPiece);
break;
}
else
{
output.Append(node.InnerHtml);
}
}
return output.ToString();
}
UPDATE
@SergeBelov provided a solution that works for the first sample input, however further testing presented an issue with an input like the one below.
Sample input #2:
some more text some more text
<div>
<p>some text</p>
<p>some more text some more text some more text some more text some more text</
</div>
Given that variable int maxLength = 7; an output should be equal to some mo.
It does not work like that because of this code where ParentNode = null:
lastNode
.Node
.ParentNode
.ReplaceChild(HtmlNode.CreateNode(lastNodeText.InnerText.Substring(0, lastNode.NodeLength - lastNode.TotalLength + maxLength)), lastNode.Node);
Creating a new HtmlNode does not seem to help because its InnterText property is readonly.
The small console program below illustrates one possible approach, which is:
UPDATE: This should still work with a text node being the first; probably, a
Trim()is required to remove the whitespace from it as below.