I’m writing a website with some articles for a Chinese audience. On the page that lists the articles I would like it to list the title and a small proportion of the article. However, the articles are encoded in a mixture of different Big5 encodings. Don’t ask me why – that’s what I got – so I can’t guarantee the number of bytes each character takes to encode.
How can I then trim down the string to show only a small proportion of the article, without chopping off the bytes needed to encode a character?
If you’re certain that you won’t have any characters outside of the BMP then you can convert the text to UCS-2 and then slice on an even boundary.