Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6342643
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T20:16:50+00:00 2026-05-24T20:16:50+00:00

I am using tika to extract text from a pdf file that has lot

  • 0

I am using tika to extract text from a pdf file that has lot of tables.

java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf

It is returning some invalid text and sometimes it is trimming white space between 2 words; for example it returns
“qu inakli fmyathematical ideas to the real world” instead of “Link mathematical ideas to the real world”.

Is there a way to minimize this kind of error? or is there another library that I can use? Does it make sense to use OCR to process these kind of pdf.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T20:16:51+00:00Added an answer on May 24, 2026 at 8:16 pm

    Try to control order when using PDFBox parser: PDFTextStripper has a flag that controls the order of lines in the document. By default (in PDFBox) it’s set to false for performance reasons (no order preserved), but Tika changed its behavior between releases switching this flag on and off.

    More details exactly on this problem in my blog Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm using ABCpdf to extract the text content of some PDF files, in particular
I am trying to parse a plain text file using Tika but getting inconsistent
We need to get tree like structure from a given text document using Java.
Using TortoiseSVN against VisualSVN I delete a source file that I should not have
Using C#, I need a class called User that has a username, password, active
I've been using Tika for a while and I know that one is supposed
Using ASP.NET MVC there are situations (such as form submission) that may require a
Using the Cocoa Framework, how can I add bullets and numbering to text?
Using the HTML5 <canvas> element, I would like to load an image file (PNG,
Using android 2.3.3, I have a background Service which has a socket connection. There's

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.