If I have a byte array that contains UTF8 content, how would I go about parsing it? Are there delimiter bytes that I can split off to get each character?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Take a look here…
http://en.wikipedia.org/wiki/UTF-8
If you’re looking to identify the boundary between characters, what you need is in the table in “Description”.
The only way to get a high bit zero is the ASCII subset 0..127, encoded in a single byte. All the non-ASCII codepoints have 2nd byte onwards with “10” in the highest two bits. The leading byte of a codepoint never has that – it’s high bits indicate the number of bytes, but there’s some redundancy – you could equally watch for the next byte that doesn’t have the “10” to indicate the next codepoint.
A codepoint in unicode isn’t necessarily the same as a character. There are modifier codepoints (such as accents), for instance.