Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र is a ligature which consists of code points त + ् + र.
When seen in simple text file editors like Notepad, त्र is shown as त् + र and is stored as three Unicode characters. However when the same file is opened in Firefox, it is shown as a proper ligature.
So my question is, how to detect such ligatures programmatically while reading the file from my code. Since Firefox does it, there must exist a way to do it programmatically. Are there any Unicode properties which contain this information or do I need to have a map to all such ligatures?
SVG CSS property text-rendering when set to optimizeLegibility does the same thing (combine code points into proper ligature).
PS: I am using Java.
EDIT
The purpose of my code is to count the characters in the Unicode text assuming a ligature to be a single character. So I need a way to collapse multiple code points into a single ligature.
While Aaron’s answer is not exactly correct, it pushed me in the right direction. After reading through the Java API docs of
java.awt.font.GlyphVectorand playing a lot on the Clojure REPL, I was able to write a function which does what I want.The idea is to find the width of glyphs in the
glyphVectorand combine the glyphs with zero width with the last found non-zero width glyph. The solution is in Clojure but it should be translatable to Java if required.Also posted on Gist.