I am looking at a parser for pdf and MS office document formats to

Question

0

Asked: June 14, 20262026-06-14T21:31:36+00:00 2026-06-14T21:31:36+00:00

I am looking at a parser for pdf and MS office document formats to

0

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any.
Any pointers for this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T21:31:37+00:00

Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as “SAX based XHTML events”1

So basically we can write a custom SAX implementation to parse the file.

The structure text output will be of the form (Meta details avoided)

<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>

In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

Override public void characters(char[] ch, int start, int length) with the logic

Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am looking at a parser for pdf and MS office document formats to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply