Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6754611
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T13:18:04+00:00 2026-05-26T13:18:04+00:00

When I try to extract text from my PDF files, it seems to insert

  • 0

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.

I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page :
http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training

I’ve tried with several other PDF files and it seems to be doing same on several pages.

I do the following:

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf

on the downloaded file and you will see spaces in following inserted wrongly in the result on console:
“• If ch ildren are able to walk to
schoo l safely this could reduce the
congestion. “

“• Develops good hab its for later life.”

“www.sheff ield.gov.uk”

“Think Ahead!, wh ich is based on the”

etc etc.

As you can see several of words above have spaces between them for no reason I can fathom.

I am on ubuntu and running Sun’s JDK 1.6.

I’ve tried this on several different PDF files and tried searching for solution on forums, there were similar bugs but all seemed to have been resolved.

Any help or if anyone else has same problem please comment. This is causing big problem in indexing the content properly for searching.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T13:18:05+00:00Added an answer on May 26, 2026 at 1:18 pm

    Unfortunately there is currently no easy solution for this.

    Internally PDF documents simply contain instructions like “place characters ‘abc’ in position X” and “place characters ‘def’ in position Y”, and PDFBox tries to reason whether the resulting extracted text should be “abc def” or “abcdef” based on things like the distance between X and Y. These heuristics are generally pretty accurate, but as you can see they don’t always produce the correct result.

    One way to improve the quality of the extracted text is to try a dictionary lookup on each extracted word or token. If the lookup fails, try combining the token with the next one. If a dictionary lookup on the combined token succeeds, then it’s fairly likely that the text extractor has mistakenly added an extra space inside the word. Unfortunately such a feature does not yet exist in PDFBox. See https://issues.apache.org/jira/browse/PDFBOX-1153 for the feature request filed for this. Patches welcome!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to extract text from pdf files using iText. The problem is: some
I'm using ABCpdf to extract the text content of some PDF files, in particular
I'm using Zend_Pdf library for extract text from pdf and I have some problems...
So I am currently using SAX to try and extract some information from a
I am trying to extract all the images from a pdf using itextsharp but
I am using iTextSharp to read text contents from PDF. I am able to
I'm using BeautifulSoup to extract some text from an HTML but I just can't
I would like to extract some text from an html file using Regex. I
I need to extract some text from a HTML table I tried using tblGridHeader.Rows[0].InnerText.ToString()
I'm try to extract info from a MySQL DB into a MS SQL DB.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.