I’m relatively new to StackOverflow and not sure if it’s appropriate place to ask design question. Site gives me a hint “The question you’re asking appears subjective and is likely to be closed”. Perhaps it should be asked on programmers.stackexchange.com. Please let me know.
Anyway.. One of the projects I’m working on is online survey engine. It’s my first big commercial project on GAE.
I need your advice on how to collect stats and efficiently record them in DataStore without bankrupting me. Initial requirements are:
- After user finishes survey client sends list of pairs [ID (int) + PercentHit (double)]. This list shows how close answers of this user match predefined answers of reference answerers (which identified by IDs). I call them “target IDs”.
- Creator of the survey wants to see aggregated % for given IDs for last hour, particular timeframe or from the beginning of the survey.
- Some surveys may have thousands of target/reference answerers.
So I created entity
public class HitsStatsDO implements Serializable
{
@Id
transient private Long id;
transient private Long version = (long) 0;
transient private Long startDate;
@Parent transient private Key parent; // fake parent which contains target id
@Transient int targetId;
private double avgPercent;
private long hitCount;
}
But writing HitsStatsDO for each target from each user would give a lot of data. For instance I had a survey with 3000 targets which was answered by ~4 million people within one week with 300K people taking survey in first day. Even if we assume they were answering it evenly for 24 hours it would give us ~1040 writes/second. Obviously it hits concurrent writes limit of Datastore.
I decided I’ll collect data for one hour and save that, that’s why there are avgPercent and hitCount in HitsStatsDO. GAE instances are stateless so I had to use dynamic backend instance.
There I have something like this:
// Contains stats for one hour
private class Shard
{
ReadWriteLock lock = new ReentrantReadWriteLock();
Map<Integer, HitsStatsDO> map = new HashMap<Integer, HitsStatsDO>(); // Key is target ID
public void saveToDatastore();
public void updateStats(Long startDate, Map<Integer, Double> hits);
}
and map with shard for current hour and previous hour (which doesn’t stay here for long)
private HashMap<Long, Shard> shards = new HashMap<Long, Shard>(); // Key is HitsStatsDO.startDate
So once per hour I dump Shard for previous hour to Datastore.
Plus I have class LifetimeStats which keeps Map<Integer, HitsStatsDO> in memcached where map-key is target ID.
Also in my backend shutdown hook method I dump stats for unfinished hour to Datastore.
There is only one major issue here – I have only ONE backend instance 🙂 It raises following questions on which I’d like to hear your opinion:
- Can I do this without using backend instance ?
- What if one instance is not enough ?
- How can I split data between multiple dynamic backend instances? It hard because I don’t know how many I have because Google creates new one as load increases.
- I know I can launch exact number of resident backend instances. But how many ? 2, 5, 10 ? What if I have no load at all for a week. Constantly running 10 backend instances is too expensive.
- What do I do with data from clients while backend instance is dead/restarting?
One thing to note is that I can’t change client much. Currently it’s JavaScript embedded into web-pages of customers. I can change RPC in some way but architecturally I cannot replace client with Google Docs forms for example.
Thank you very much in advance for your thoughts.
My service went live and I want to share how I implemented it.
So instead of gathering data in memory of single backend instance for an hour I decided to gather it in multiple dynamic backend instances and update shard for current hour in Datastore every 10 mins from each instance. Class
Shardstays the same with the exception ofsaveToDatastore()where I now updateHitsStatsDOsin transaction loop to make sure it’s updated even if another backend instance changes shard at the moment.In order to fetch HitsStatsDO really fast I decided to put target ID in fake parent key and timestamp if this hard to primary ID like this
This entity takes only 2 writes to be stored. Amount of writes is never more than ([amount of backend instances] * 2 * 6) per hour which is not bad. Also I can pre-create keys in my code and do batch-get from Datastore.
Similarly I changed
HitsStatsTotalDOwhich contains stats from the beginning of survey. It looks like thisSame thing – 2 writes to store/update.
Service went live 3 days ago. Maximum load so far was 230 QPS. I’m using dynamic B1 type instances. In config I set maximum of 4 instances for now but to my pleasure GAE never instantiated more than one. And surprisingly I haven’t had concurrency exceptions yet.
Let me know if you have any questions or think I missed something.
And thank you everyone for your help. StackOverflow is really awesome community.