I’m building an app to serve large amounts of data via REST API and I’m looking for some inputs on how to architecture it. I’m using .NET (C# 4.0), ASP.NET MVC and Sql Server 2008.
Right now I have about 400k rows in a relational database with +- 5% of it updated through the day by an internal app that goes directly to the database. I need to serve this data via a REST API returning a custom format of XML. However, the data needs to be processed before I can output it. The good thing is that I can pre-process it before if needed.
I wrote a small POC that gets the data, process it and cache it into a local XML file. Due to processing this process takes about an hour to work on all 400k rows. After the cache is done I just return the physical file in every request.
Now I need to be able to update the data as it gets updated in the source and update my cache so I don´t need to generate everything everytime a single row gets updated.
I’m thinking about using AppFabric to keep a memory cache and use physical files just to make sure that in case the memory cache goes out I don’t need to start from scratch. As soon as a row gets updated in the source I would update the cache memory and write the physical file to make sure its up to date.
So my primary source would be the AppFabric cache, then the physical cache file and as a last resort regenerate the file from the database what would take about an hour and make the file unavailable to whoever calls it.
I’m not very happy with this, but it is what I got. Any suggestions?
Thanks a lot!
Thanks for your clarification above. Here’s an option based on that.
Add a table to your DB. Call it Products_Processed (or Prices, whatever). This new table has one row for each row in Products (eg, 1-to-1 with the source data). Each row in this new table contains the processed data for the corresponding source row.
Each time a row is updated in Products by the external app, you compute just that row and update the corresponding row in Products_Processed.
Here are a few ways to get get code run on just the newly updated entries:
This approach keeps your derived data logically close to the source of truth (in the DB with the source data) and reduces the number of times you copy/format/handle the data. Also, importantly, using the tried-and-true DB-provided mechanisms for detecting/triggering on changed data will save you from writing a lot of your own syncing code.
Now, returning your results is essentially streaming out a
select * from Products_Processed. If you want to return the processed data for just specific products, you have the full power of SQL and your schema; likewise for sorting. This whole setup should be fast enough that you don’t need to cache the file on disk. In fact, MSSQL caching should probably keep the most/all of the processed data rows in RAM if you have enough, so you’ll rarely have to do a cold select (and if you don’t have enough RAM, consider what a few extra gigs are worth compared to your time; throwing hardware at a problem is never cheating ;).(However, if you really want to write it out to disk, you can store the offsets into the physical file for each row record, and quickly update individual data in the file as the corresponding processed data row is updated.)