This is a design/algorithm question.
Here’s the outline of my scenario:
- I have a large table (say, 5 mil. rows) of data which I’ll call Cars
- Then I have an application, which performs a
SELECT *on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.) - This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
- The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let’s say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can’t have the data change between SELECT calls.
What I’m looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)
If you want a “snapshot” effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that’s not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified…if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don’t watch out.