Has anyone used a CGPDFScanner to parse the ToUnicode CMap stream entry of a font dictionary? I’m encountering some troubles.
I obtain the CGPDFStream reference from the dictionary and try to build a CGPDFScanner with it. The problem is that CGPDFScanner take a CGPDFContentStream as argument, not a CGPDFStream.
When I parse a CGPDFPage for text operator I can easily get the CGPDFContentStream with CGPDFContentStreamCreateWithPage, but the sister function CGPDFContentStreamCreateWithStream – which is said that “You can use this function to get access to the contents of a form, pattern, Type3 font, or any PDF stream” – is somewhat muddy in the CGPDFContentStream Reference and I’m unable to find sample code.
Anyway, I pass the CMap stream as the stream argument, the resource CGPDFDictionary obtained from the stream with CGPDFStreamGetDictionary as the streamResources parameter and the page content stream as the parent. The resource dictionary can be obtained easy from the stream itself, so why bother to ask it in the first place? On top of that, passing NULL as the parameters but the first seems to have no effect whatsoever.
The result is always the same: when I try to scan the content stream with a scanner set up with a few callbacks I get the following messages
`begincodespacerange' isn't an operator. `beginbfrange' isn't an operator. ... `endbfrange' isn't an operator.
for every operator set up in the callback table. This for every CMap encountered.
So, I’m unsure if the content stream is set up the wrong way, if the operators are invalid or if the CGPDFScanner cannot be used to parse the CMap, even if it is a regular pdf stream object, and have to resort to write my own scanner to parse the stream data.
CGPDFScanner can parse only PDF content streams, streams that contain content for display. Page content, form XObjects, pattern, Type3 fonts share the same stream format. The ToUnicode CMap is a totally different stream, it uses a different syntax from content streams. You need to write you own scanner for parsing the CMap. The ToUnicode CMap format is documented in Adobe PDF specification, section 5.9.2 of PDF reference 1.7, Adobe Technical Notes #5014 and #5411.