I’d like to get some community consensus on a good design to be able to store and query word frequency counts. I’m building an application in which I have to parse text inputs and store how many times a word has appeared (over time). So given the following inputs:
- “To Kill a Mocking Bird”
- “Mocking a piano player”
Would store the following values:
Word Count
-------------
To 1
Kill 1
A 2
Mocking 2
Bird 1
Piano 1
Player 1
And later be able to quickly query for the count value of a given arbitrary word.
My current plan is to simply store the words and counts in a database, and rely on caching word count values … But I suspect that I won’t get enough cache hits to make this a viable solution long term.
Can anyone suggest algorithms, or data structures, or any other idea that might make this a well-performing solution?
I don’t understand why you feel a database would not be a suitable solution. You will probably only have about 100000 rows and the small size of the table will mean that it can be stored entirely in memory. Make the word the primary key and lookups will be very fast.