I’m going to be working on software (in c#) that needs to read/write Unicode

Question

0

Asked: June 15, 20262026-06-15T13:22:39+00:00 2026-06-15T13:22:39+00:00

I’m going to be working on software (in c#) that needs to read/write Unicode

0

I’m going to be working on software (in c#) that needs to read/write Unicode strings (specifically English, German, Spanish and Arabic) to a hardware device. The firmware developer tells me that his code expects to store each string as fixed-length byte array in one binary file so he can quickly access any string using an index (index * length = starting offset and then read the fixed-length number of bytes). I understand that .NET internally uses a UTF-16 encoding which I believe is technically a variable-length encoding (depending upon the number of the Unicode code point). I’m fairly certain that English, German and Spanish would all use two bytes/character when encoded using UTF-16 but I’m not so sure about Arabic. It looks like there might be some Arabic characters that could possibly require three bytes each in UTF-16 and that would seem to break the firmware developers plan to store the strings as a fixed length.

First, can anyone confirm my understanding of the variable-length nature of UTF-8/UTF-16 encodings? And second, although it would waste a lot of space, is UTF-32 (fixed-size, each character represented using 4 bytes) the best option for ensuring that each string could be stored as a fixed length? Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T13:22:40+00:00

Unicode terminology:

Each entry in the Unicode character set is a code point
Encoded code points consist of one or more code units in a transformation format (UTF-8 uses 8 bit code units; UTF-16 uses 16 bit code units)
The user-visible grapheme might consist of a sequence of code points

So:

A code point in UTF-8 is 1, 2, 3 or 4 octets wide
A code point in UTF-16 is 2 or 4 octets wide
A code point in UTF-32 is 4 octets wide
The number of graphemes rendered on the screen might be less than the number of code points

So, if you want to support the entire Unicode range you need to make the fixed-length strings a multiple of 32 bits regardless of which of these UTFs you choose as the encoding (I’m assuming unused bytes will be set to 0x0 and that these will be appended, trimmed during I/O.)

In terms of communicating length restrictions via a user interface you’ll probably want to decide on some compromise based on a code unit size and the typical customer rather than try to find the width of the most complicated grapheme you can build.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m going to be working on software (in c#) that needs to read/write Unicode

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply