Consider the following Ruby code analyzing a three-byte UTF-8 string: #encoding: utf-8 s =

Question

0

Asked: June 1, 20262026-06-01T09:22:26+00:00 2026-06-01T09:22:26+00:00

Consider the following Ruby code analyzing a three-byte UTF-8 string: #encoding: utf-8 s =

0

Consider the following Ruby code analyzing a three-byte UTF-8 string:

#encoding: utf-8
s = "\x65\xCC\x81"
p [s.bytesize, s.length, s, s.encoding.name]
#=> [3, 2, "é", "UTF-8"]

As described on this page of mine the above really is a two-character string: Latin lowercase e followed by Combining Acute Accent. However, it looks like one character, and this matters when laying out fixed-width displays.

For example, look at the two entries for “moiré.svg” on this directory listing and notice how one of them has messed up the column alignment.

How can I calculate the ‘monospace visual length’ of a string in Ruby, which does not include any zero-width combining characters? (One valid technique might be a way to transform a Unicode string into its canonical representation, turning the above into "\xC3\xA9" which also looks like é but has a length of 1.)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T09:22:27+00:00

The unicode_utils gem may help

Current link: https://github.com/lang/unicode_utils
Old link: http://unicode-utils.rubyforge.org/UnicodeUtils.html

There is a char_display_width method:

require "unicode_utils/char_display_width"
UnicodeUtils.char_display_width("別")  # => 2
UnicodeUtils.char_display_width(0x308) # => 0
UnicodeUtils.char_display_width("a")   # => 1

There is a string display_width method:

require "unicode_utils/display_width"
UnicodeUtils.display_width("別れ") => 4
UnicodeUtils.display_width("12") => 2
UnicodeUtils.display_width("a\u{308}") => 1

Also look at each_grapheme.

(Thanks Michael Anderson for pointing out the additional methods)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Consider the following Ruby code analyzing a three-byte UTF-8 string: #encoding: utf-8 s =

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply