I have two tables: “servers” and “stats”
servers has a column called “id” that auto-increments.
stats has a column called “server” that corresponds to a row in the servers table, a column called “time” that represents the time it was added, and a column called “votes” that I would like to get the average of.
I would like to fetch all the servers (SELECT * FROM servers) along with the average votes of the 24 most recent rows that correspond to each server. I believe this is a “greatest-n-per-group” question.
This is what I tried to do, but it gave me 24 rows total, not 24 rows per group:
SELECT servers.*,
IFNULL(AVG(stats.votes), 0) AS avgvotes
FROM servers
LEFT OUTER JOIN
(SELECT server,
votes
FROM stats
GROUP BY server
ORDER BY time DESC LIMIT 24) AS stats ON servers.id = stats.server
GROUP BY servers.id
Like I said, I would like to get the 24 most recent rows for each server, not 24 most recent rows total.
This is another approach.
This query is going to suffer the same performance problems as other queries here that return correct results, because the execution plan for this query is going to require a SORT operation on EVERY row in the stats table. Since there is no predicate (restriction) on the time column, EVERY row in the stats table will be considered. For a REALLY large
statstable, this is going to blow out all available temporary space before it dies a horrible death. (More notes on performance below.)What this query is doing is sorting the stats table, by server and by descending order on the time column. (Inline view aliased as
u.)With the sorted result set, we assign a row numbers 1,2,3, etc. to each row for each server. (Inline view aliased as
t.)With that result set, we filter out any rows with a rownumber > 24, and we calculate an average of the
votescolumn for the “latest” 24 rows for each server. (Inline view aliased ass.)As a final step, we join that to the servers table, to return the requested resultset.
NOTE:
The execution plan for this query will be COSTLY for a large number of rows in the
statstable.To improve performance, there are several approaches we could take.
The simplest might to be include in the query a predicate the EXCLUDES a significant number of rows from the
statstable (e.g. rows withtimevalues over 2 days old, or over 2 weeks old). That would significantly reduce the number of rows that need to be sorted, to determine the “latest” 24 rows.Also, with an index on
stats(server,time), it’s also possible that MySQL could do a relatively efficient “reverse scan” on the index, avoiding a sort operation.We could also consider implementing an index on the stats table on
(server,"reverse_time"). Since MySQL doesn’t yet support descending indexes, the implementation would really be a regular (ascending) index on an a derivedrtimevalue (a “reverse time” expression that is ascending for descending values oftime(for example,-1*UNIX_TIMESTAMP(my_timestamp)or-1*TIMESTAMPDIFF('1970-01-01',my_datetime).Another approach to improve performance would be to keep a shadow table containing the most recent 24 rows for each server. That would be simplest to implement if we can guarantee that “latest rows” won’t be deleted from the
statstable. We could maintain that table with a trigger. Basically, whenever a row is inserted into thestatstable, we check if thetimeon the new rows is later than the earliesttimestored for the server in the shadow table, if it is, we replace the earliest row in the shadow table with the new row, being sure to keep no more than 24 rows in the shadow table for each server.And, yet another approach is to write a procedure or function that gets the result. The approach here would be to loop through each server, and run a separate query against the stats table to get the average
votesfor the latest 24 rows, and gather all of those results together. (That approach mighty really be more of a workaround to avoiding a sort on huge temporary set, just to enable the resultset to be returned, not necessarily making the return of the resultset blazingly fast.)The bottom line for performance of this type of query on a LARGE table is restricting the number of rows considered by the query AND avoiding a sort operation on a large set. That’s how we get a query like this to perform.
ADDENDUM
To get a “reverse index scan” operation (to get the rows from
statsordered using an index WITHOUT a filesort operation), I had to specify DESCENDING on both expressions in the ORDER BY clause. The query above previously hadORDER BY server ASC, time DESC, and MySQL always wanted to do a filesort, even specifying theFORCE INDEX FOR ORDER BY (stats_ix1)hint.If the requirement is to return an ‘average votes’ for a server only if there are at least 24 associated rows in the stats table, then we can make a more efficient query, even if it is a bit more messy. (Most of the messiness in the nested IF() functions is to deal with NULL values, which do not get included in the average. It can be much less messy if we have a guarantee that
votesis NOT NULL, or if we exclude any rows wherevotesis NULL.)With a covering index on
stats(server,time,votes), the EXPLAIN showed MySQL avoided a filesort operation, so it must have used a “reverse index scan” to return the rows in order. Absent the covering index, and index on ‘(server,time), MySQL used the index if I included an index hint, with theFORCE INDEX FOR ORDER BY (stats_ix1)` hint, MySQL avoided a filesort as well. (But since my table had less than 100 rows, I don’t think MySQL put much emphasis on avoiding a filesort operation.)The time, votes, and avg_sofar expressions are commented out (in the inline view aliased as
t); they aren’t needed, but they are for debugging.The way that query stands, it needs at least 24 rows in stats for each server, in order to return an average. (That may be acceptable.) But I was thinking that in general, we could return a running total, total so far (tot) and a running count (cnt).
(If we replace the
WHERE t.num = 24withWHERE t.num <= 24, we can see the running average in action.)To return the average where there aren’t at least 24 rows in stats, that’s really a matter of identifying the row (for each server) with the maximum value of num that is <= 24.