Possible Duplicate:
extracting text from MS word files in python
I want to parse (in order to perform a search with an expression) a .doc file with a script in python. It runs on a unix machine.
Can anyone help ?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
You may take a look at this project: python-docx.
After downloading the library, you can run
python example-extracttext.py docfile.docx textfile.txt | grep some-expressionin the shell. Surely you can also do more sophisticated search in python code when necessary.The shortcoming of python-docx is it currently only supports ms-Word 2007/2008, if that concerns you, I recommend antiword, which supports Microsoft Word version 2, 6, 7, 97, 2000, 2002 and 2003. Actually I’ve been using that in my vimrc to be able to view ms-word files in VIM editor. Although it’s not a python script, it can easily be invoked from Python.