Does anyone know if the standard Java library (any version) provides a means of

Question

0

Asked: May 27, 20262026-05-27T12:40:01+00:00 2026-05-27T12:40:01+00:00

Does anyone know if the standard Java library (any version) provides a means of

0

Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I’m looking for an efficient equivalent of this:

"some really long string".getBytes("UTF-8").length

I need to calculate a length prefix for potentially long serialized messages.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T12:40:01+00:00

Here’s an implementation based on the UTF-8 specification:

public class Utf8LenCounter {
  public static int length(CharSequence sequence) {
    int count = 0;
    for (int i = 0, len = sequence.length(); i < len; i++) {
      char ch = sequence.charAt(i);
      if (ch <= 0x7F) {
        count++;
      } else if (ch <= 0x7FF) {
        count += 2;
      } else if (Character.isHighSurrogate(ch)) {
        count += 4;
        ++i;
      } else {
        count += 3;
      }
    }
    return count;
  }
}

This implementation is not tolerant of malformed strings.

Here’s a JUnit 4 test for verification:

public class LenCounterTest {
  @Test public void testUtf8Len() {
    Charset utf8 = Charset.forName("UTF-8");
    AllCodepointsIterator iterator = new AllCodepointsIterator();
    while (iterator.hasNext()) {
      String test = new String(Character.toChars(iterator.next()));
      Assert.assertEquals(test.getBytes(utf8).length,
                          Utf8LenCounter.length(test));
    }
  }

  private static class AllCodepointsIterator {
    private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/
    private static final int SURROGATE_FIRST = 0xD800;
    private static final int SURROGATE_LAST = 0xDFFF;
    private int codepoint = 0;
    public boolean hasNext() { return codepoint < MAX; }
    public int next() {
      int ret = codepoint;
      codepoint = next(codepoint);
      return ret;
    }
    private int next(int codepoint) {
      while (codepoint++ < MAX) {
        if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }
        if (!Character.isDefined(codepoint)) { continue; }
        return codepoint;
      }
      return MAX;
    }
  }
}

Please excuse the compact formatting.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Does anyone know if the standard Java library (any version) provides a means of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply