I want to read the PDF file using hadoop, how it is possible? I

Question

0

Asked: May 29, 20262026-05-29T21:13:25+00:00 2026-05-29T21:13:25+00:00

I want to read the PDF file using hadoop, how it is possible? I

0

I want to read the PDF file using hadoop, how it is possible?
I only know that hadoop can process only txt files, so is there anyway to parse the PDF files to txt.

Give me some suggestion.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T21:13:27+00:00

Editorial Team

2026-05-29T21:13:27+00:00Added an answer on May 29, 2026 at 9:13 pm

An easy way would be to create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to read the PDF file using hadoop, how it is possible? I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply