In Java I have an arbitrary HTML document as a string. For simplicity, say:
String original = "Hello, <strong>this</strong> is a string";
And I have a record of various locations in the string, always within the text, not within a tag. For example the index of the start and end of the word “is” are 29 and 31.
I then perform a transformation on the string – in this case stripping out the HTML tags. This leaves:
original = "Hello, this is a string";
Is there an elegant way of getting the new start and end index of the word “is” now (12 and 14)?
The one possible solution I can think of is inserting a “flag” at each original index, stripping the HTML, then removing the flags while recording their locations. This shouldn’t cause any issues with the HTML stripping as the indices always occur outside the tags.
If this is actually the best way, does anyone have any recommendations for a good choice of “flag” that definitely won’t coincidentally occur in any HTML documents?
The best approach is going to depend on how you’re stripping the HTML tags. If you’re simply removing everything enclosed in <> brackets, then you can just loop through the old string and keep a count of everything outside <> brackets preceding the old index. Something along these lines would probably work: