Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8498469
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T00:24:36+00:00 2026-06-11T00:24:36+00:00

I need to parse a PDF document. I already implemented the parser and used

  • 0

I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems.

But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I get:

Vo rber eitung auf die Motorr adsaison. Viele Motorr adf ahr er

All the bold words should be connected, but somehow the PDF Parser is adding whitespaces into the words. But when I copy and paste the content from the PDF into a Textfile I dont get these spaces.

First I thought it’s because of the PDF Parsing library I’m using, but also with another library I get the exact same issue.

I had a look on the singleSpaceWidth from the parsed words and I noticed that it’s varying always then, when it’s adding a whitespace. I tried to put them manually together. But since there isn’t really a pattern to recombine the words it’s almost impossible.

Did anyone else have a similar issue or even a solution to that problem?

As requested, here is some more information:

  • iText Version 5.2.1
  • http://prine.ch/whitespacesProblem.pdf (Link to the pdf)

Parsing with SemTextExtractionStrategy:

PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src);

SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    // Set the page number on the strategy. Is used in the Parsing strategies.
    semTextExtractionStrategy.pageNumber = i;

    // Parse text from page
    PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy);
}

Here the SemTextExtractionStrategy method which actually parses the text. There I manually add after every parsed word a whitespace, but somehow it does split the words in the detection:

@Override
public void parseText(TextRenderInfo renderInfo, int pageNumber) {      

    this.pageNumber = pageNumber;

    String text = renderInfo.getText();

    currTextBlock.getText().append(text + " ");

    ....
}

Here is the whole SemTextExtraction Class but in there it does only call the method from above (parseText):

public class SemTextExtractionStrategy implements TextExtractionStrategy {

    // Text Extraction Strategies
    public ColumnDetecter columnDetecter = new ColumnDetecter();

    // Image Extraction Strategies
    public ImageRetriever imageRetriever = new ImageRetriever();

    public int pageNumber = -1;

    public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>();
    public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>();

    public SemTextExtractionStrategy() {

        // Add all text parsing strategies which are later on applied on the extracted text
        // textParsingStrategies.add(fontSizeMatcher);
        textParsingStrategies.add(columnDetecter);

        // Add all image parsing strategies which are later on applied on the extracted text
        imageParsingStrategies.add(imageRetriever);
    }

    @Override
    public void beginTextBlock() {

    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        // TEXT PARSING
        for(TextParsingStrategy strategy : textParsingStrategies) {
            strategy.parseText(renderInfo, pageNumber);
        }
    }

    @Override
    public void endTextBlock() {

    }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) {
        for(ImageParsingStrategy strategy : imageParsingStrategies) {
            strategy.parseImage(renderInfo);
        }
    }
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T00:24:37+00:00Added an answer on June 11, 2026 at 12:24 am

    I have processed the given PDF file with the following Ghostscript command:

    gs -o out.pdf -q -sDEVICE=pdfwrite -dOptimize=false -dUseFlageCompression=false -dCompressPages=false -dCompressFonts=false whitespacesProblem.pdf
    

    This command created a file out.pdf, which does not have the stream encodings, so it is better readable. The interesting part is in line 52, which I split into multiple lines for readability:

    [
      (&;&)-287.988
      (672744)29.9906
      (+\(%)30.01
      (+!4)29.9876
      (&4)-287.989
      (%4)30.0039
      (&1&8)-287.975
      (3=\)!)-288.021
      (*&4)30.0212
      (&=23)-287.996
      (+1%)-287.99
      (\(=&)-288.011
      (8&1&)-287.974
      (672744)29.9906
      (+\(3+=378$)-250.977
      (#7\)!)
    ]TJ
    

    Between the parentheses are the text characters. I changed some of them and watched the rendered PDF file to see which character represents which glyph. Then I decoded the text:

    [
      (ele)-287.988
      (Motorr)29.9906 ***
      (adf)30.01 ***
      (ahr)29.9876 ***
      (er)-287.989
      (fr)30.0039
      (euen)-287.975
      (sich)-288.021
      ...
    ]
    

    So there is indeed whitespace between the characters. In your case this is probably the kerning of the font. The question is now how your PDF library interprets this whitespace, and it seems to me, that even “negative whitespace” is rendered into a space in the resulting string.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to parse large text (about 1000 pages of word or pdf document)and
I am need to parse a pdf file. I would like to use objective-c
I want to do the following with iText: (1) parse an existing PDF file
I have a multi-page PDF file that has information I need to parse. The
I need to parse some text from pdfs but the pdf formatting results in
I need to parse and modify the XML in android ..Can any one suggest
I am trying to create a regex to parse document links (pdf, ppt, xls,
I need parse through a file and do some processing into it. The file
Hi I need parse and deserialize pseudo JSON string. Input data: {aBubbleData[ 'jaja2581' ]={
Need to parse a file for lines of data that start with this pattern

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.