I have a database with 2.000.000 messages. When an user receipt a message I need find relevant messages in my database based in occurrence of words.
I had tried run a batch process to summarize my database:
1 – Store all words(except an, a, the, of, for…) of all messages.
2 – Create association between all messages and the words contained therein (I also store the frequence of this word appears in the message.)
Then, when I receipt a message:
1 – I parse words (it looks like with the first step of my batch process.)
2 – Perform query in database to fetching messages sorted by numbers of coincident words.
However, the process of updating my words base and the query to fetching similar messages are very heavy and slow. The word base update lasts ~1.2111 seconds for a message of 3000 bytes. The query similar messages lasts ~9.8 seconds for a message with same size.
The database tuning already been done and the code works fine.
I need a better algorithm to do it.
Any ideas?
I would recommend using setting up Apache Solr (http://lucene.apache.org/solr/). It is very easy to setup and index millions of documents. Solr handles all the optimization necessary (although it is open source so you can tweak it if you feel you need to).
You can then query using available APIs, I prefer the Java API SolrJ (http://wiki.apache.org/solr/Solrj). I typically see results returned in under one second.
Solr typically outperforms MySQL for text indexing.