I know there is String#length and the various methods in Character which more or

Question

0

Asked: May 24, 20262026-05-24T03:16:59+00:00 2026-05-24T03:16:59+00:00

I know there is String#length and the various methods in Character which more or

0

I know there is String#length and the various methods in Character which more or less work on code units/code points.

What is the suggested way in Java to actually return the result as specified by Unicode standards (UAX#29), taking things like language/locale, normalization and grapheme clusters into account?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T03:16:59+00:00

java.text.BreakIterator is able to iterate over text and can report on “character”, word, sentence and line boundaries.

Consider this code:

def length(text: String, locale: java.util.Locale = java.util.Locale.ENGLISH) = {
  val charIterator = java.text.BreakIterator.getCharacterInstance(locale)
  charIterator.setText(text)

  var result = 0
  while(charIterator.next() != BreakIterator.DONE) result += 1
  result
}

Running it:

scala> val text = "Thîs lóo̰ks we̐ird!"
text: java.lang.String = Thîs lóo̰ks we̐ird!

scala> val length = length(text)
length: Int = 17

scala> val codepoints = text.codePointCount(0, text.length)
codepoints: Int = 21

With surrogate pairs:

scala> val parens = "\uDBFF\uDFFCsurpi\u0301se!\uDBFF\uDFFD"
parens: java.lang.String = surpíse!

scala> val length = length(parens)
length: Int = 10

scala> val codepoints = parens.codePointCount(0, parens.length)
codepoints: Int = 11

scala> val codeunits = parens.length
codeunits: Int = 13

This should do the job in most cases.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I know there is String#length and the various methods in Character which more or

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply