All the documentation I can find seems to suggest I can only extract the entire file’s content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Actually Tika does handle pages (at least in pdf) by sending elements
<div><p>before page starts and</p></div>after page ends. You can easily setup page count in your handler using this (just counting pages using only<p>):When doing this with pdf you may run into the problem when parser doesn’t send text lines in proper order – see Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood) on how to handle this.