I’ve been working on creating a subclass of db.Model that is automatically cached, i.e.:
- instance.put would store the entity in memcache before persisting it to the datastore
- class.get_by_key_name would first check the cache, and if missed, would go to the datastore to retrieve it and cache it after retrieval
I developed the approach below (which appears to work for me), but I have a few questions:
- I had read Nick Johnson’s article on efficient model memcaching which suggests implementing the serialization for memcache through protocol buffers. Looking at the memcache API source code in the SDK, it looks like Google has already implemented protobuf serialization by default. Is my interpretation correct?
- Am I missing some important details (which could get me in the future) in the way I am subclassing db.Model or overriding the two methods?
- Is there a more efficient way of implementing what I’ve done below?
- Are there guidelines, benchmarks or best practices for when such entity caching would make sense from a performance perspective? Or would it always make sense to cache entities? On a related note, should I be reading anything into the fact that Google hasn’t provided a cached model in the modeling API? Are there too many special cases to be thinking about?
Below is my current implementation. I would really appreciate any and all guidance/suggestions on caching entities (even if your response is not a direct answer to one of the 4 questions above, but relevant to the topic overall).
from google.appengine.ext import db
from google.appengine.api import memcache
import os
import logging
class CachedModel(db.Model):
'''Subclass of db.Model that automatically caches entities for put and
attempts to load from cache for get_by_key_name
'''
@classmethod
def get_by_key_name(cls, key_names, parent=None, **kwargs):
cache = memcache.Client()
# Ensure that every new deployment of the application results in a cache miss
# by including the application version ID in the namespace of the cache entry
namespace = os.environ['CURRENT_VERSION_ID'] + '_' + cls.__name__
if not isinstance(key_names, list):
key_names = [key_names]
entities = cache.get_multi(key_names, namespace=namespace)
if entities:
logging.info('%s (namespace=%s) retrieved from memcache' % (str(entities.keys()), namespace))
missing_key_names = list(set(key_names) - set(entities.keys()))
# For keys missed in memcahce, attempt to retrieve entities from datastore
if missing_key_names:
missing_entities = super(CachedModel, cls).get_by_key_name(missing_key_names, parent, **kwargs)
missing_mapping = zip(missing_key_names, missing_entities)
# Determine entities that exist in datastore and store them to memcache
entities_to_cache = dict()
for key_name, entity in missing_mapping:
if entity:
entities_to_cache[key_name] = entity
if entities_to_cache:
logging.info('%s (namespace=%s) cached by get_by_key_name' % (str(entities_to_cache.keys()), namespace))
cache.set_multi(entities_to_cache, namespace=namespace)
non_existent = set(missing_key_names) - set(entities_to_cache.keys())
if non_existent:
logging.info('%s (namespace=%s) missing from cache and datastore' % (str(non_existent), namespace))
# Combine entities retrieved from cache and entities retrieved from datastore
entities.update(missing_mapping)
if len(key_names) == 1:
return entities[key_names[0]]
else:
return [entities[key_name] for key_name in key_names]
def put(self, **kwargs):
cache = memcache.Client()
namespace = os.environ['CURRENT_VERSION_ID'] + '_' + self.__class__.__name__
cache.set(self.key().name(), self, namespace=namespace)
logging.info('%s (namespace=%s) cached by put' % (self.key().name(), namespace))
return super(CachedModel, self).put(**kwargs)
Rather than reinventing the wheel, why not switch to NDB, which already implements memcaching of model instances?