I’m using appengine servers. I expect to get many requests (dozens) in close proximity that will put some of my data in an inconsistent state. The cleanup of that data can be efficiently batched – for example, it would be best to run my cleanup code just once, after the dozens of requests have all completed. I don’t know exactly how many requests there will be, or how close together they will be. It is OK if the cleanup code is run multiple times, but it must be run after the last request.
What’s the best way to minimize the number of cleanup runs?
Here’s my idea:
public void handleRequest() {
manipulateData();
if (memCacheHasCleanupToken()) {
return; //yay, a cleanup is already scheduled
} else {
scheduleDeferredCleanup(5 seconds from now);
addCleanupTokenToMemCache();
}
}
...
public void deferredCleanupMethod() {
removeCleanupTokenFromMemcache();
cleanupData();
}
I think this will break down because cleanupData might receive outdated data even after some request has found that there IS a cleanup token in the memcache (HRD latency, etc), so some data might be missed in the cleanup.
So, my questions:
- Will this general strategy work? Maybe if I use a transactional lock on a datastore entity?
- What strategy should I use?
The general strategy you suggest will work, providing the data that needs cleaning up isn’t stored on each instance (eg, it’s in the datastore or memcache), and provided your
schduleDeferredCleanupmethod uses the task queue. An optimization would be to use task names that are based on the time interval in which they run to avoid scheduling duplicate cleanups if the memcache key expires.One issue to watch out for with the procedure you describe above, though, is race conditions. As stated, a request being processed at the same time as the cleanup task may check memcache, observe the token is there, and neglect to enqueue a cleanup task, whilst the cleanup task has already finished, but not yet removed the memcache key. The easiest way to avoid this is to make the memcache key expire on its own, but before the related task will execute. That way, you may schedule duplicate cleanup tasks, but you should never omit one that’s required.