Say I have a sample for which 5 million data objects are stored as rows in SQL Server. If I need to run some stats on the data, would it be better to have a table for each sample, or one giant table, where I would select by sample id and then run the stats?
There may eventually be hundreds or even thousands of samples- which seems like one massive table.
But I’m not a SQL Server expert so I can’t say whether one would be faster than the other…
Or maybe a better way to deal with such a large data set? I was hoping to use SQL CLR with C# to do my heavy lifting…
If you need to deal with such a large dataset, my gut feeling tells me T-SQL and working in sets will be significantly faster than anything you can do in SQL-CLR and a RBAR (row-by-agonizing-row) approach… dealing with large sets of data, summing up and selecting, that’s what T-SQL is always been made for and what it’s good at.
5 million rows isn’t really an awful lot of data – it’s a nice size dataset. But if you have the proper indices in place, e.g. on the columns you use in your
JOINconditions, in yourWHEREclause and yourORDER BYclause, you should be just fine.If you need more and more detailed advice – try to post your table structure, explain how you will query that table (what criteria you use for
WHEREandORDER BY) and we should be able to provide some more feedback.