Unicode code units can be of variable size since characters can be represented by

Question

0

Asked: May 26, 20262026-05-26T02:33:01+00:00 2026-05-26T02:33:01+00:00

Unicode code units can be of variable size since characters can be represented by

0

Unicode code units can be of variable size since characters can be represented by 2 bytes or more bytes (sequence of 2 bytes). So if stored in binary format, how can a program know how to read them back?

Lets say ‘a’ is represented by 0F0F 13F3 and ‘b’ is represented by 02AD BC39 09F3 459F

If I write them in file foo.txt:

0F0F 13F3 02AD BC39 09F3 459F

Then how would I know where to stop for ‘a’ and ‘b’?

Guys here I am talking about reading , writing pure unicode i.e without converting it into any other format based upon popular charset such as utf-8 .

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T02:33:02+00:00

Editorial Team

2026-05-26T02:33:02+00:00Added an answer on May 26, 2026 at 2:33 am

First, not all Unicode representations are variable length. UTF-32 and USC-2 are fixed length. UTF-8 and UTF-16 are each in their own way variable length.

Second, if you read the specification, you will learn that the sequences are self-describing. The byte values (in UTF-8) that can be first bytes can’t be second or third, etc. Ditto for the surrogate pairs that represent non-BMP characters in UTF-16.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Unicode code units can be of variable size since characters can be represented by

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply