What’s a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word’s, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it’s okay to fail to count anything, but this case needs to be catchable so we’re aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I’m open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?
Here’s a link to some Linux word-to-text converters.
For example you could use
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using “wc”. AbiWord does support the docx format