I need to parse a PDF document. I already implemented the parser and used

Question

0

Asked: June 11, 20262026-06-11T00:24:36+00:00 2026-06-11T00:24:36+00:00

I need to parse a PDF document. I already implemented the parser and used

0

I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems.

But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I get:

Vo rber eitung auf die Motorr adsaison. Viele Motorr adf ahr er

All the bold words should be connected, but somehow the PDF Parser is adding whitespaces into the words. But when I copy and paste the content from the PDF into a Textfile I dont get these spaces.

First I thought it’s because of the PDF Parsing library I’m using, but also with another library I get the exact same issue.

I had a look on the singleSpaceWidth from the parsed words and I noticed that it’s varying always then, when it’s adding a whitespace. I tried to put them manually together. But since there isn’t really a pattern to recombine the words it’s almost impossible.

Did anyone else have a similar issue or even a solution to that problem?

As requested, here is some more information:

iText Version 5.2.1
http://prine.ch/whitespacesProblem.pdf (Link to the pdf)

Parsing with SemTextExtractionStrategy:

PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src);

SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    // Set the page number on the strategy. Is used in the Parsing strategies.
    semTextExtractionStrategy.pageNumber = i;

    // Parse text from page
    PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy);
}

Here the SemTextExtractionStrategy method which actually parses the text. There I manually add after every parsed word a whitespace, but somehow it does split the words in the detection:

@Override
public void parseText(TextRenderInfo renderInfo, int pageNumber) {      

    this.pageNumber = pageNumber;

    String text = renderInfo.getText();

    currTextBlock.getText().append(text + " ");

    ....
}

Here is the whole SemTextExtraction Class but in there it does only call the method from above (parseText):

public class SemTextExtractionStrategy implements TextExtractionStrategy {

    // Text Extraction Strategies
    public ColumnDetecter columnDetecter = new ColumnDetecter();

    // Image Extraction Strategies
    public ImageRetriever imageRetriever = new ImageRetriever();

    public int pageNumber = -1;

    public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>();
    public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>();

    public SemTextExtractionStrategy() {

        // Add all text parsing strategies which are later on applied on the extracted text
        // textParsingStrategies.add(fontSizeMatcher);
        textParsingStrategies.add(columnDetecter);

        // Add all image parsing strategies which are later on applied on the extracted text
        imageParsingStrategies.add(imageRetriever);
    }

    @Override
    public void beginTextBlock() {

    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        // TEXT PARSING
        for(TextParsingStrategy strategy : textParsingStrategies) {
            strategy.parseText(renderInfo, pageNumber);
        }
    }

    @Override
    public void endTextBlock() {

    }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) {
        for(ImageParsingStrategy strategy : imageParsingStrategies) {
            strategy.parseImage(renderInfo);
        }
    }
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T00:24:37+00:00

I have processed the given PDF file with the following Ghostscript command:

gs -o out.pdf -q -sDEVICE=pdfwrite -dOptimize=false -dUseFlageCompression=false -dCompressPages=false -dCompressFonts=false whitespacesProblem.pdf

This command created a file out.pdf, which does not have the stream encodings, so it is better readable. The interesting part is in line 52, which I split into multiple lines for readability:

[
  (&;&)-287.988
  (672744)29.9906
  (+\(%)30.01
  (+!4)29.9876
  (&4)-287.989
  (%4)30.0039
  (&1&8)-287.975
  (3=\)!)-288.021
  (*&4)30.0212
  (&=23)-287.996
  (+1%)-287.99
  (\(=&)-288.011
  (8&1&)-287.974
  (672744)29.9906
  (+\(3+=378$)-250.977
  (#7\)!)
]TJ

Between the parentheses are the text characters. I changed some of them and watched the rendered PDF file to see which character represents which glyph. Then I decoded the text:

[
  (ele)-287.988
  (Motorr)29.9906 ***
  (adf)30.01 ***
  (ahr)29.9876 ***
  (er)-287.989
  (fr)30.0039
  (euen)-287.975
  (sich)-288.021
  ...
]

So there is indeed whitespace between the characters. In your case this is probably the kerning of the font. The question is now how your PDF library interprets this whitespace, and it seems to me, that even “negative whitespace” is rendered into a space in the resulting string.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to parse a PDF document. I already implemented the parser and used

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply