What’s wrong with this code… I am trying to parse pdf files and extract

Question

0

Asked: May 25, 20262026-05-25T16:08:35+00:00 2026-05-25T16:08:35+00:00

What’s wrong with this code… I am trying to parse pdf files and extract

0

What’s wrong with this code… I am trying to parse pdf files and extract the text from it… But for some pdf I am able to extract the text… And for some it throws the error

Invalid dictionary, found: '' but expected: '/'
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@67fb878

And also I didn’t get any metadata values in md variable for some pdf… But for Some I get that…

This is my code..!!
Some problem with the ByteArray??

    private BinaryParser binaryParser;
    binaryParser.parse(page.getBinaryData());


    public void parse(byte[] data) {
            InputStream is = null;
            try {
                is = new ByteArrayInputStream(data);
                text = null;
                Metadata md = new Metadata();
                metaData = new HashMap<String, String>();
                text = tika.parseToString(is, md).trim();
                processMetaData(md);
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                IOUtils.closeQuietly(is);
            }
        }

private void processMetaData(Metadata md){
        if ((getMetaData() == null) || (!getMetaData().isEmpty())) {
            setMetaData(new HashMap<String, String>());
        }
        for (String name : md.names()){
            getMetaData().put(name.toLowerCase(), md.get(name));
        }
    }

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T16:08:35+00:00

Tika is not perfect. It will have problems on many PDF files (unless a lot has changed in the last year). Make sure you are using an updated version of Tika. When I was using Tika it was at version 0.8 (9 months ago). There was a bug at this version that caused PDF parsing to be particularly problematic. I sidestepped the issue by using PDFBox, which Apache Tika wraps. There is some of my code wrapping PDFBox at the end of this post in case you decide to try this route.

If nothing else, using PDFBox directly will give you more control over the parameters. One such parameter is to handle “beaded” text. For example, a newspaper with columns is beaded whereas a letter is not. PDFBox can attempt to maintain the flow of writing, but it doesn’t always do a great job. If you are not extracting text from non-beaded PDFs, you might want to disable this feature.

You may also want to try the program pdftotext. Once again, make sure you have the latest version. With all PDF-to-text converters, performance changes rapidly with the version!

import org.apache.pdfbox.util.PDFTextStripper;

PDFTextStripper stripper = new PDFTextStripper;

public static String pdfbox(InputStream is, Writer writer) throws IOException, ConversionException {
        Boolean force = true;

        PDDocument document = null;
        try {
            document = PDDocument.load(is, force); // force extraction

            stripper.setForceParsing(force); // continue when errors are encountered.
            stripper.setSortByPosition(false); // text may not be in visual order.
            stripper.setShouldSeparateByBeads(true); // beads are columns, attempt to handle them.

            stripper.writeText(document, writer);
        }
        finally {
            try {
                if (document != null) {
                    document.close();
                }
            }
            catch (Exception e) {
                throw new ConversionException(e);
            }
        }
    }

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

What’s wrong with this code… I am trying to parse pdf files and extract

And also I didn’t get any metadata values in md variable for some pdf… But for Some I get that…

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply