I’ve been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:
SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))
Though I’m getting weird results, and the query doesn’t group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)
If I group by date(startdate) instead, the row counts match per day basis but I’d like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:
2006-01-01 | 5
2006-01-08 | 10
so that the day timestamp is the first column and second is the amount of rows per week.
Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting
DATE(timestamp)but grouping byWEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in ANY order.Consider the following 2 rows (with timestamp in date format for ease of reading):
Your query is grouping by
WEEK(StartDate)which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.HOWEVER
DATE(Timestamp)Is also in the select list and since there is noORDER BYstatement the query has no idea which Timestamp to return ‘20120601’ or ‘20120701’. So even on this small result set you have a 50:50 chance of getting:and a 50:50 chance of getting
If you add more data to the dataset as so:
You could get
or
You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!
EDIT
Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):
You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren’t in your group by.