I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Would result in:
Hello, my <strong>name</st...
What I would want is:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
How can I do this?
While my question is for how to do it in PHP, it would be good to know how to do it in C#… either should be OK as I think I would be able to port the method over (unless it is a built in method).
Also note that I have included an HTML entity ´ – which would have to be considered as a single character (rather than 7 characters as in this example).
strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.
Assuming you are using valid XHTML, it’s simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again “on your way out”.
Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass
falseas the third argument. Other multibyte encodings are not supported, though you may hack in support by usingmb_convert_encodingto convert to UTF-8 before calling the function, then converting back again in everyprintstatement.(You should always be using UTF-8, though.)
Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.