I’ve got a table which keeps track of article views. It has the following columns:
id, article_id, day, month, year, views_count.
Let’s say I want to keep track of daily views / each day for every article. If I have 1,000 user written articles. The number of rows would compute to:
365 (1 year) * 1,000 => 365,000
Which is not too bad. But let say. The number of articles grow to 1M. And as time passes by to 3 years. The number of rows would compute to:
365 * 3 * 1,000,000 => 1,095,000,000
Obviously, over time, this table will keep growing. And quite fast. What problems will this cause? Or should I not worry since RDBM’s handle situations like this quite commonly?
I plan on using the views data in our reports. Either break it down to months or even years. Should I worry about 1B+ rows in a table?
The question to ask yourself (or your stakeholders) is: do you really need 1-day resolution on older data?
Have a look into how products like MRTG, via RRD, do their logging. The theory is you don’t store all the data at maximum resolution indefinitely, but regularly aggregate them into larger and larger summaries.
That allows you to have 1-second resolution for perhaps the last 5-minutes, then 5-minute averages for the last hour, then hourly for a day, daily for a month, and so on.
So, for example, if you have a bunch of records like this for a single article:
You would then at regular periods create a new record(s) that summarises these data, in this example just the total count for the month
Or the average per day:
Of course you may need some flag to indicate the “summarised” status of the data, in this case I’ve used a ‘type’ column for finding the “raw” records and the processed records, allowing you to purge out the day records as required.
(I haven’t tested that query, it’s just an example)