I have tons PDFs that I need to convert to some structured format that I can interpret (HTML/XML/etc)
PDFs are in this format:
http://img840.imageshack.us/img840/5407/pdfv.png
I have tried so far a lot of softwares that convert to HTML but all of them have no capabilities to separate the images, they just take like a printscreen of the page without the text and then use this image as a background in the html, using css to position the text
Like this: http://img37.imageshack.us/img37/5015/examplelp.jpg
I have a bunch of PDFs so process each ones images manually is not an option. Does anyone knows any solution for this (even paid softwares)?
I had a similar problem a while back and ended up writing my own solution. It’s called PDFX and it’s free to use. It converts PDF to a structured-format XML and also renders any bitmap images (not vector graphics) found in the PDF separately.
Example input/output can be found here. You might want to give it a try.