Consider the following Ruby code analyzing a three-byte UTF-8 string:
#encoding: utf-8
s = "\x65\xCC\x81"
p [s.bytesize, s.length, s, s.encoding.name]
#=> [3, 2, "é", "UTF-8"]
As described on this page of mine the above really is a two-character string: Latin lowercase e followed by Combining Acute Accent. However, it looks like one character, and this matters when laying out fixed-width displays.
For example, look at the two entries for “moiré.svg” on this directory listing and notice how one of them has messed up the column alignment.
How can I calculate the ‘monospace visual length’ of a string in Ruby, which does not include any zero-width combining characters? (One valid technique might be a way to transform a Unicode string into its canonical representation, turning the above into "\xC3\xA9" which also looks like é but has a length of 1.)
The unicode_utils gem may help
Current link: https://github.com/lang/unicode_utils
Old link: http://unicode-utils.rubyforge.org/UnicodeUtils.html
There is a
char_display_widthmethod:There is a string
display_widthmethod:Also look at
each_grapheme.(Thanks Michael Anderson for pointing out the additional methods)