Possible Duplicate:
solution to convert PDFs, DOCs, DOCXs into a textual format with python
I am making a document search engine which indexes popular binary formats. I am looking for python libraries for this purpose.
Reliable converters proved too hard to find. PyPDF never works accurately. Please reccomend:
- python libraries that convert these formats to text
- or cross-platform, standalone programs that can be called as a subprocess
.docxby unzipping it and then rootling around in the resulting folder structure. See How can I search a word in a Word 2007 .docx file?..docis probably the hardest. Is COM scripting an option for you? That is, asking Word to open the file and export it as text? There’s a linux utility extracting text from MS word files in python.