I’m trying to extract text from pdf.
Pdf reference is a real hell and leaves a lot of practical questions unanswered.
My question is: if the font dictionary contains /ToUnicode CMap AND /Encoding, is it true that CMap always covers all characters used with this font, meaning that I don’t need to use /Encoding or anything else to get text printed with this font?
Chapter 5.9 of the pdf reference seems to answer yes, but some of my tests seem to answer no.
I’m trying to extract text from pdf. Pdf reference is a real hell and
Share
Chapter 5.9 is correct and the ToUnicode Cmap should be enough for text extraction. The problem is that many PDF files do not follow the PDF specification properly and you have to implement your own heuristics for text extraction.
You start with the PDF specification and then you update your text extraction method with various enhancements based on the non-conforming PDF files you encounter.