I’m trying to figure out the best practice for implementing a complex algorithm on stored information in a relational DB.
Specifically: I want to implement a variation of the k-means algorithm (a document clustering algorithm) on a large MS SQL Server database containing TFxIDF vectors of many documents (these vectors are used as input for the algorithm).
My first thought was doing the entire thing in SQL using stored procedures, functions, views and all the other basic SQL Server tools, but then I thought maybe I should write managed code (I’m fluent in C#) that will be executed on the SQL Server.
Performance is an issue here, so I need to take that in consideration also.
I would appreciate any advice on the path I should take.
Thank you!
It always is. When looking at this kind of code, there are two opposing trends that you have to consider:
On the other hand:
Take these two points together, and the best course for performance is typically to use the querying capabilities in the database to pull down just the subset of records that you really need, and maybe do some of the easier pre-processing — the low-hanging fruit, if you will. Then finish the heavy lifting on an application server, in parallel if possible.