I am trying to extract text from a PDF. The PDF contains text in

Question

0

Asked: May 25, 20262026-05-25T16:33:22+00:00 2026-05-25T16:33:22+00:00

I am trying to extract text from a PDF. The PDF contains text in

0

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized.
Here is an example:
Say text in PDF is : पवार
What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü

are there any suggestion?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T16:33:23+00:00

PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.

PDF also has facilities to embed the text as text alongside the visual representation, but that’s solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that’s probably why nearly no one bothers).

I think the best bet for you if the PDF doesn’t have the plain text available is OCR on the PDF as an image.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to extract text from a PDF. The PDF contains text in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply