I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any.
Any pointers for this?
I am looking at a parser for pdf and MS office document formats to
Share
Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as “SAX based XHTML events”1
So basically we can write a custom SAX implementation to parse the file.
The structure text output will be of the form (Meta details avoided)
In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).
Override public void characters(char[] ch, int start, int length) with the logic
Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution