I have a database table with N records, each of which needs to be refreshed every 4 hours. The “refresh” operation is pretty resource-intensive. I’d like to write a scheduled task that runs occasionally and refreshes them, while smoothing out the spikes of load.
The simplest task I started with is this (pseudocode):
every 10 minutes:
find all records that haven't been refreshed in 4 hours
for each record:
refresh it
set its last refresh time to now
(Technical detail: “refresh it” above is asynchronous; it just queues a task for a worker thread pool to pick up and execute.)
What this causes is a huge resource (CPU/IO) usage spike every 4 hours, with the machine idling the rest of the time. Since the machine also does other stuff, this is bad.
I’m trying to figure out a way to get these refreshes to be more or less evenly spaced out — that is, I’d want around N/(10mins/4hours), that is N/24, of those records, to be refreshed on every run. Of course, it doesn’t need to be exact.
Notes:
- I’m fine with the algorithm taking time to start working (so say, for the first 24 hours there will be spikes but those will smooth out over time), as I only rarely expect to take the scheduler offline.
- Records are constantly being added and removed by other threads, so so we can’t assume anything about the value of
Nbetween iterations. - I’m fine with records being refreshed every 4 hours +/- 20 minutes.
Do a full refresh, to get all your timestamps in sync. From that point on, every 10 minutes, refresh the oldest N/24 records.
The load will be steady from the start, and after 24 runs (4 hours), all your records will be updating at 4-hour intervals (if N is fixed). Insertions will decrease refresh intervals; deletions may cause increases or decreases, depending on the deleted record’s timestamp. But I suspect you’d need to be deleting quite a lot (like, 10% of your table at a time) before you start pushing anything outside your 40-minute window. To be on the safe side, you could do a few more than N/24 each run.