I’m developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is – to store a database entry in a statistics table – each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
- Website – a general zone used to count unique users as I stopped trusting Google Analytics lately.
- Category – self descriptive.
- Minisite – self descriptive.
- Product Image – whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can’t see any way to do that without losing flexibility.
My questions:
- How large scale data parsing applications – like Google Analytics load the data so fast?
- What is the best way for me to do it?
- Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you’re looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data ‘online’ when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.