All the documentation I can find seems to suggest I can only extract the

Question

0

Asked: May 23, 20262026-05-23T00:35:43+00:00 2026-05-23T00:35:43+00:00

All the documentation I can find seems to suggest I can only extract the

0

All the documentation I can find seems to suggest I can only extract the entire file’s content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T00:35:44+00:00

Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>):

public abstract class MyContentHandler implements ContentHandler {
    private String pageTag = "p";
    protected int pageNumber = 0;
    ...
    @Override
    public void startElement (String uri, String localName, String qName, Attributes atts) throws SAXException  {  

        if (pageTag.equals(qName)) {
            startPage();
        }
    }

    @Override
    public void endElement (String uri, String localName, String qName) throws SAXException {  

        if (pageTag.equals(qName)) {
            endPage();
        }
    }

    protected void startPage() throws SAXException {
    pageNumber++;
    }

    protected void endPage() throws SAXException {
    return;
    }
    ...
}

When doing this with pdf you may run into the problem when parser doesn’t send text lines in proper order – see Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood) on how to handle this.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

All the documentation I can find seems to suggest I can only extract the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply