Say I have a few tables in the MSSQL database, each with about 5-10 attributes. There are some simple associations between the tables, but each of the table have 500,000 to 1,000,000 rows.
There is an algorithm that runs on that data (all of it), so before running the algorithm, I have to retrieve all the data from the database. The algorithm does not change the data, only reads it, so I just need to retrieve the data.
I am using LINQ to SQL. To retrieve all the data takes about two minutes. What I want to know is whether the serialization to file and then deserialization (when needed) would actually load the data faster.
The data is about 200 MB, and I don’t mind saving it to disk. So, would it be faster if the objects were deserialized from the file or by using LINQ 2 SQL DataContext?
Any experiences with this?
I would argue that LINQtoSQL may not be the best choice for this kind of application. When you are talking about so many objects, you incur quite some overhead creating object instances (your persistent classes).
I would choose a solution where a stored procedure retrieves only the necessary data via ADO.NET, the application stores it in memory (memory is cheap nowadays, 200MB should not be a problem) and the analyzing algorithm is run on the in-memory data.
I don’t think you should store the data on file. In the end, your database is also simply one or more files that are read by the database engine. So you either
The latter option involves a lot of overhead without any advantages as far as I can see.
EDIT: If your data changes very infrequently, you may consider preprocessing your data before analyzing and caching the preprocessed data somewhere (in the database or on the file system). This only makes sense if your preprocessed data can be analyzed (a lot) faster than the raw data. Maybe some preprocessing can be done in the database itself.