We need to track user activity in different time periods like 24 hours, 7 days etc. We don’t anticipate a very large number of different periods but the numbers of users will be very large, probably in the millions. Nightly cronjob to summarize the stats for each user doesn’t sound reasonable. I know in the past I’ve tracked network usage like this with RRD tables but those were just BerkeleyDB’s and had to be one file per statistic which wouldn’t work, but that idea seems like what I’m after. Is there a pattern/best practice that I’m overlooking?
Share
It depends on which architecture you want to use and which hardware you can afford.
For massive data analysis I would go for a Cluster-based framework like Hadoop: and build map/reduce functions which will treat your data.
see http://hadoop.apache.org/.
User activities can be stored in dailiy files to be uploaded to the Hadoop cluster and then processed.
Such solutions can provide you with the necessary scalability with commodity only hardware required.