Does anybody have an example of spliting a html string (coming from a tiny mce editor) and splitting it into N parts using C#?
I need to split the string evenly without splitting words.
I was thinking of just splitting the html and using the HtmlAgilityPack to try and fix the broken tags. Though I’m not sure how to find the split point, as Ideally it should be based purley on the text rather than the html aswell.
Anybody got any ideas on how to go about this?
UPDATE
As requested, here is an example of input and desired output.
INPUT:
<p><strong>Lorem ipsum dolor sit amet, <em>consectetur adipiscing</em></strong> elit.</p>
OUTPUT (When split into 3 cols):
Part1: <p><strong>Lorem ipsum dolor</strong></p>
Part2: <p><strong>sit amet, <em>consectetur</em></strong></p>
Part3: <p><strong><em>adipiscing</em></strong> elit.</p>
UPDATE 2:
I’ve just had a play with Tidy HTML and that seems to work well at fixing broken tags, so this may be good option if I can find a way to locate the split pints?
UPDATE 3
Using a method similar to this Truncate string on whole words in .NET C#, I’ve now managed to get a list of plain text words that will make up each part. So, say using Tidy HTML I have a valid XML structure for the html, and given this list of words, anybody got an idea on what would now be the best way to split it?
UPDATE 4
Can anybody see an issue with using a regex to find the indices with the HTML in the followin way:
Given the plain text string “sit amet, consectetur”, replace all spaces with the regex “(\s|<(.|\n)+?>)*”, in theory finding that string with any combination of spaces and/or tags
I could then just use Tidy HTML to fix the broken html tags?
Many thanks
Matt
A Proposed Solution
Man, this is a curse of mine! I apparently cannot walk away from a problem without spending up-to-and-including an unreasonable amount of time on it.
I thought about this. I thought about HTML Tidy, and maybe it would work, but I had trouble wrapping my head around it.
So, I wrote my own solution.
I tested this on your input and on some other input that I threw together myself. It seems to work pretty well. Surely there are holes in it, but it might provide you with a starting point.
Anyway, my approach was this:
HtmlWordclass below.HtmlLineclass below.HtmlAgilityPack.HtmlNodeobject. These I have implemented in theHtmlHelperclass below.Am I crazy for doing all this? Probably, yes. But, you know, if you can’t figure out any other way, you can give this a try.
Here’s how it works with your sample input:
Output:
And now for the code:
HtmlWord class
HtmlLine class
HtmlHelper static class
Conclusion
Just to reiterate: this is a thrown-together solution; I’m sure it has problems. I present it only as a starting point for you to consider — again, if you’re unable to get the behavior you want through other means.