I am looking for an efficient way of reading the raw text from any ms office document (word, excel or powerpoint), then displaying a distinct word list and a count of how many times that word is used. If possible I would like to be able to exclude common words (‘and’, ‘to’, ‘the’, etc).
What is the best way I can achive this in C#?
You should look into Lucene.NET – it has the ability to build word indexes from a variety of sources – including, I believe, word documents.