I am trying to extract the UTF-8 character value from an embedded true type font file contained in a PDF. Is anyone aware of a method of doing this? The values in the PDF might be something like ‘2%dd! w!|<~’ and this would end up as ‘Hello World’ in the PDF represented by the corresponding glyphs from the TTF. I’d like to be able to extract the wchar values here. Is this possible? Does the UTF-8 value for each character exist in the TTF?
Share
Glyph ID’s do not always correspond to Unicode character values – especially with non latin scripts that use a lot of ligatures and variant glyph forms where there is not a one-to-one correspondance between glyphs and characters.
Only Tagged PDF files store the Unicode text – otherwise you may have to reconstruct the characters from the glyph names in the fonts. This is possible if the fonts used have glyphs named according to Adobe’s Glyph Naming Convention or Adobe Glyph List Specification – but many fonts, including the standard Windows fonts, don’t follow this naming convention.