I’m looking for suggestions for scaling a points leaderboard system. I already have a working version using a very normalized strategy. This first version was essentially a table which looked something like this.
UserPoints - PK: (UserId,Date) +------------+--------+---------------------+ | UserId | Points | Date | +------------+--------+---------------------+ | 1 | 10 | 2011-03-17 07:16:36 | | 2 | 35 | 2011-03-17 08:09:26 | | 3 | 40 | 2011-03-17 08:05:36 | | 1 | 65 | 2011-03-17 09:01:37 | | 2 | 16 | 2011-03-17 10:12:35 | | 3 | 64 | 2011-03-17 12:51:33 | | 1 | 300 | 2011-03-17 12:19:21 | | 2 | 1200 | 2011-03-17 13:24:13 | | 3 | 510 | 2011-03-17 17:29:32 | +------------+--------+---------------------+
I then have a stored procedure which basically does a GroupBy UserID and Sums the Points. I can also pass @StartDate and @EndDate parameters to create a leaderboard for a specific time period. For example, time windows for Top Users for the Day / Week / Month / Lifetime.
This seemed to work well with a moderate amount of data, but things became noticeably slower as the number of points records passed a million or so. The test data I’m working with is just over a million point records created by about 500 users distributed over a timespan of 3 months.
Is there a different way to approach this? I have experimented with denormalizing the data by pre-grouping the points into hour datetime buckets to reduce the number of rows. But I’m starting to think the real problem I need to worry about is the increasing number of users that need to be accounted for in the leaderboard. The time window sizes will generally be small but more and more users will start generating points within any given window.
Unfortunately I don’t have access to ‘Jobs’ since I’m using SQL Azure and the Agent is not available (yet). But, I am open to the idea of scaling this using a different storage system if you are convincing enough.
My past work experience tells me I should look into data warehousing since this is almost a reporting problem. But at the same time I need it to be as real-time as possible.
Update
Ultimately, I would like to support custom leaderboards that could span from Monday 8am – Friday 6pm every week. But that’s down the road and why I’m trying to not get too fancy with the aggregation. I’m willing to settle with basic Day/Week/Month/Year/AllTime windows for now.
The tricky part is that I really can’t store them denormalized because I need these windows to be TimeZone convertible. The system is mult-tenant and therefore all data is stored as UTC. The problem is a week starts at different hours for different customers. Aggregating the sums together will cause some points to fall into the wrong buckets.
I decided to go with the idea of storing points along with a timespan (StartDate and EndDate columns) localized to the customer’s current TimeZone setting. I realized an extra benefit with this is that I can ‘purge’ old leaderboard round data after a few monts without affecting the lifetime total of points.