I am using these technologies: SQL Server 2005, ASP.NET MVC, NHibernate/sharp architecture and would like to mine some text with the final aim of presenting some web based stats . I have several millions of keywords and several millions of documents and would like to run some queries based on these documents indexed by the keywords. I have played a bit with SQL Server’s full text indexing but I am not too impressed. So I am wondering whether Lucene.Net might be an alternative.
I have never used Lucene.Net but understand that it is a 1:1 port of the Java version. So my first question is whether it is worth studying the book ‘Lucene in action’ – provided that Lucene would be the right ‘technology’?
Thanks.
Best wishes,
Christian
Well,
FIRST – update SQL Server. You use a two generations outdated version which had the first implementation of full text search in SQL Server and many (known and fixed) shortcomings.
Second – Lucene may really be better suited. SQL is primarily a database server, and the full text search does a lot of things, but also has a lot of limitations.
But entering Lucene DOES provide a significant complication – distributed transactions, backup handling turn a lot more complicated as they are two systems. SQL 2008 R2 does a much better job here (full text index stored in the database file).
That said, also be careful with performance. You may need a QUITE HIGH END SERVER if you want to run a lot of queries in parallel (which can happen easily with a web application). This may require multiple database servers running read only replications – something SQL Server does a lot easier than Lucene (as in: out of the box).
I suggest you just get Lucene and play with it 😉 Not a lot more needed.