How do I data mine a pile of text to get keywords by usage? (“Jacob Smith” or “fence”)
And is there a software to do this already? even semi-automatically, and if it can filter out simple words like “the”, “and”, “or”, then I could get to the topics quicker.
The general algorithm is going to go like this:
- Obtain Text - Strip punctuation, special characters, etc. - Strip "simple" words - Split on Spaces - Loop Over Split Text - Add word to Array/HashTable/Etc if it doesn't exist; if it does, increment counter for that wordThe end result is a frequency count of all words in the text. You can then take these values and divide by the total number of words to get a percentage of frequency. Any further processing is up to you.
You’re also going to want to look into Stemming. Stemming is used to reduce words to their root. For example
going => go,cars => car, etc.An algorithm like this is going to be common in spam filters, keyword indexing and the like.