When parsing a PDF, given a string (popped from the Tj or TJ operator callbacks) with the Identity-H encoding how do you map that string to a unicode (say UTF8) representation?
If I need a CMap for this, how do I create (or retrieve) and apply the CMap?
You’ll probably have to parse the font data itself. Identity-H just means “use the bytes as raw glyph indexes into the given font”. That’s why you MUST embed fonts when using Identity-H… different versions of the same font need not have the same glyph order.
There’s example code on how to do this sort of thing in several different open source projects. iText, for example (yes, I’m biased).
You’d mentioned a CMap. Identity-H fonts can have a CMap but aren’t required to do so. The /ToUnicode entry will be a stream that is a CMap, as defined in some adobe spec somewhere. They aren’t all that complex:
Wow. That particular CMap is horribly inefficient. A “bfrange” starts from parameter 1, and goes to and includes parameter 2, maping values starting at parameter 3 (and continuing on until there are no more things to map.
For example:
could be represented as
A quick google search turned up the CMap/CID font spec.
There are also
beginbfchar/endbfcharwhich just take two parameters (src and dest values, no ranges), CID based versions (at which point you need to have access to Adobe’s character ID tables. They’re part of Acrobat/Reader installations, though Reader will need to be prodded into downloading the various Language Packs (or kits or whatever they’re called)), and various other stuff you really out to read that spec to find out about.