I started developing an website analytics system in MySQL for a project I’m working on but have quickly realised it’s not going to be sufficient for my needs (in terms of scalability, speed etc). After doing a fair bit of research MongoDB keeps cropping up as good candidate, the only problem I have is that I have no experience in it and don’t know the best practices of high performance/size MongoDB databases as well as I do for MySQL.
When a user visits a website it needs to record the standard info (IP, browser info, website ID, URL, username). It also needs to record every subsequent page the user visits (current timestamp, url). If a user leaves the website and comes back 10 days later, it needs to log that visit and also record that it’s a returning user (identified by their username).
In addition to logging visits for multiple websites (looking at 500 records being added per second) it needs to have reporting capability. I’m fine with producing graphs etc but I need to know how to extract the data from the database efficiently. I’d like to be able to provide graphs that show activity for every 15 minutes, but an hour would be sufficient if it’s more practical.
As a side thought it’d be nice if it could be capable of real-time reporting in the future, but that’s outside the scope of the current project.
Now I’ve read the article at http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics but it doesn’t mention anything about high traffic websites – it could just be capable of dealing with a few thousands records for all I know. Do I follow the concept of that post and pull reporting directly from that collection, or would it be better to pre-analyse the data and archive it into a separate collection?
Any thoughts on the data insertion, database structure and reporting would be hugely appreciated!
Well… it seems facebook uses MySQL to a great degree. When it comes to NoSQL, I believe it’s not necessarily the technology, it’s data structures and algorithms.
What you are facing is a situation of potential high write-throughput. One approach to high write throughput that fits your problem well is sharding: No matter how big the machine and how efficient the software, there will be a limit of the number of writes a single machine can handle. Sharding splits the data across multiple servers, so you can write to different servers. For example, users A-M write to server 1, users N-Z to server 2.
Now, sharding comes at the cost of complexity, because it needs balancing, aggregations across all shards can be tricky, you need to maintain multiple independent databases, etc.
That’s a technology thing: MongoDB sharding is rather simple, because they support auto-sharding which does most of the nasty stuff for you. I don’t think you’ll need it at 500 inserts per second, but it’s good to know it’s there.
For the schema design, it’s important to think about the shard key, which will be used to determine which shard is responsible for the document. This might depend on your traffic patterns. Suppose you have a user who operates a fair. Once a year, his website goes totally nuts, but 360 days it is one of the lower traffic sites. Now if you shard on your
CustomerId, that particular user might lead to problems. On the other hand, if you shard onVisitorId, you’ll have to hit each shard for a simplecount().The analysis part depends largely on the queries you want to support. The real deal slice&dice is rather challenging I’d say, in particular if you want to support near-real-time analytics. A much easier approach is to limit the user’s options and only provide a small set of operations. These can also be cached, so you won’t have to do all aggregations every time.
In general, analytics can be tricky because there are many features that need relations. For example, cohort analysis will require you to consider only those log entries that were generated by a specific group of users. An
$inquery will do the trick for smaller cohorts, but if we’re talking about tens of thousands of users, it won’t do. You could select only a random subset of users, because that should be statistically sufficient, but of course it depends on your specific requirements.For the analysis of large amounts of data, Map/Reduce comes in handy: it will do the processing on the server, and Map/Reduce also benefits from sharding, because the jobs can be processed individually by each shard. However, depending on a gazillion factors, these jobs will take some time.
I believe that the blog of Boxed Ice has some information on this; they definitely have experience in handling lots of analytical data using MongoDB.