I’m working on a Web app to display some analytics data from a MYSQL database table. I expect to collect data from about 10,000 total users at the most. This table is going to have millions of records per user.
I’m considering giving each user their own table, but more importantly I want to figure out how to optimize data retrieval.
I get data from the database table using a series of SELECT COUNT queries for a particular day. An example is below:
SELECT * FROM
(SELECT COUNT(id) AS data_point_1 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '1') AS col_1
CROSS JOIN
(SELECT COUNT(id) AS data_point_2 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '0') AS col_2
CROSS JOIN ...
When I want to retrieve data from the last 30 days, the query will be 30 times as long as it is above; 60 days likewise, etc. The user will have the ability to select the number of days e.g. 30, 60, 90, and a custom range.
I need the data for a time series chart. Just to be clear, data for each day could range from thousands of records to millions.
My question is:
-
Is this the most performant way of retrieving this data, or is there a better way to getting all the time series data I need in one SQL query?! How is this going to work when a user needs data from the last 2 years i.e. a MySQL Query that is potential over a thousand lines long?!
-
Should I consider caching the retrieved data (using memcache for example) for extended periods of time e.g. an hour or more, to reduce server (Being that this is analytics data, it really should be real-time but I’m afraid of overloading the server with queries for the same data even when there are no changes)?!
Any assitance would be appreciated.
First, you should not put each user in a separate table. You have other options that are not nearly as intrusive on your application.
You should consider partitioning the data. Based on what you say, I would have one partition by time (by day, week, or month) and an index on the users. Your query should probably look more like:
You can then pivot this, either in an outer query or in an application.
I would also suggest that you summarize the data on a daily basis, so your analyses can run on the summarized tables. This will make things go much faster.