Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6581363
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T16:08:35+00:00 2026-05-25T16:08:35+00:00

What’s wrong with this code… I am trying to parse pdf files and extract

  • 0

What’s wrong with this code… I am trying to parse pdf files and extract the text from it… But for some pdf I am able to extract the text… And for some it throws the error

Invalid dictionary, found: '' but expected: '/'
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@67fb878

And also I didn’t get any metadata values in md variable for some pdf… But for Some I get that…

This is my code..!!
Some problem with the ByteArray??

    private BinaryParser binaryParser;
    binaryParser.parse(page.getBinaryData());


    public void parse(byte[] data) {
            InputStream is = null;
            try {
                is = new ByteArrayInputStream(data);
                text = null;
                Metadata md = new Metadata();
                metaData = new HashMap<String, String>();
                text = tika.parseToString(is, md).trim();
                processMetaData(md);
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                IOUtils.closeQuietly(is);
            }
        }

private void processMetaData(Metadata md){
        if ((getMetaData() == null) || (!getMetaData().isEmpty())) {
            setMetaData(new HashMap<String, String>());
        }
        for (String name : md.names()){
            getMetaData().put(name.toLowerCase(), md.get(name));
        }
    }
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T16:08:35+00:00Added an answer on May 25, 2026 at 4:08 pm

    Tika is not perfect. It will have problems on many PDF files (unless a lot has changed in the last year). Make sure you are using an updated version of Tika. When I was using Tika it was at version 0.8 (9 months ago). There was a bug at this version that caused PDF parsing to be particularly problematic. I sidestepped the issue by using PDFBox, which Apache Tika wraps. There is some of my code wrapping PDFBox at the end of this post in case you decide to try this route.

    If nothing else, using PDFBox directly will give you more control over the parameters. One such parameter is to handle “beaded” text. For example, a newspaper with columns is beaded whereas a letter is not. PDFBox can attempt to maintain the flow of writing, but it doesn’t always do a great job. If you are not extracting text from non-beaded PDFs, you might want to disable this feature.

    You may also want to try the program pdftotext. Once again, make sure you have the latest version. With all PDF-to-text converters, performance changes rapidly with the version!

    import org.apache.pdfbox.util.PDFTextStripper;
    
    PDFTextStripper stripper = new PDFTextStripper;
    
    public static String pdfbox(InputStream is, Writer writer) throws IOException, ConversionException {
            Boolean force = true;
    
            PDDocument document = null;
            try {
                document = PDDocument.load(is, force); // force extraction
    
                stripper.setForceParsing(force); // continue when errors are encountered.
                stripper.setSortByPosition(false); // text may not be in visual order.
                stripper.setShouldSeparateByBeads(true); // beads are columns, attempt to handle them.
    
                stripper.writeText(document, writer);
            }
            finally {
                try {
                    if (document != null) {
                        document.close();
                    }
                }
                catch (Exception e) {
                    throw new ConversionException(e);
                }
            }
        }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

For some reason, after submitting a string like this Jack’s Spindle from a text
I have a bunch of posts stored in text files formatted in yaml/textile (from
I have a French site that I want to parse, but am running into
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have this code: - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock { NSString *someString = [[NSString
I have some data like this: 1 2 3 4 5 9 2 6
I have a text area in my form which accepts all possible characters from
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
Does anyone know how can I replace this 2 symbol below from the string
link Im having trouble converting the html entites into html characters, (&# 8217;) i

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.