I am developing a website in which it is important to keep track of EVERY click and EVERY impression that EACH client makes.
So I have a database which includes (among others) two tables: “clicks” and “impressions”. The table “impressions” has the following structure:
ip int unsigned not null,
ts int unsigned not null,
main_post int unsigned not null,
side_post int unsigned not null,
PRIMARY KEY (ip,ts,main_post,side_post)
So there are few columns and they are all of the int type, so it is an efficient table. HOWEVER, what worries me is that this table will grow incredibly fast. With every request, five new rows will be added to this table, because there are always five side posts next to each main post. Plus, with every request, I will want to check this table to make sure I am not showing the same post again to the client.
The “clicks” table is similar, but less extreme (only one row gets added per request).
So my question is: will this be to much? Will this tables, after a few weeks or months of use, become too big to handle? And if yes, what is the best solution? Maybe starting a new table each week, or each month?
Thanks in advance
Answers to these and related questions will dictate what you need. However, if your site becomes extremely popular and you decide you really need long histories, then the table will become unmanageable. You have a table with 16 bytes per row; you have an index on that that is likely to cost you 20-24 bytes per row (with a bit of overhead). Thus, for each page impression, you are going to be using 200 bytes or so in your impressions tables. At N pages per second, you’ll be using about 20×N MiB/day.
I’m not clear how you are going to structure your queries against this table to ensure that the user is not shown the same material again. I don’t know if you are thinking that IP is IP address (have you heard of IPv6?) and TS is a timestamp. I’m not convinced that IP address is an appropriate way of tracking users (the same user may have multiple IP addresses over the course of a day – connecting from the office and from home, not to mention the coffee shop). I’m not sure that the PK index is going to help your queries very much.
When you know how you plan to use the data, then you can decide how to store it.
My strong suspicion is that you will find this design too onerous. The table is big enough that your queries are going to slow you down dramatically. Yes, I believe you will need to manage the table carefully, dropping old data routinely while retaining the most recent data.