This is a query that totals up every players game results from a game and displays the players who match the conditions.
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses
from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(* ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
dotagames as dg,
games as ga,
dotaplayers as dp,
scores as sc
where dg.winner <> 0
and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h
order by totalscore desc
Now I’m not too sure what’s the best way to go but what would in your opinion be to optimize this query?
I run a Q6600 @ 2.4ghz, 4gb of ram, 64-bit Linux Ubuntu 9.04 system and this query can take up to 6.7 seconds to run (I do have a huge database).
Also I would like to paginate the results as well and executing extra conditions on top of this query is far too slow….
I use django as a frontend so any methods that include using python +/- django methods would be great. MySQL, Apache2 tweaks are also welcome. And of course, I’m open to changing the query to make it run faster.
Thanks for reading my question; look forward to reading your answers!
Edit: EXPLAIN QUERY RESULTS
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 783 Using filesort
2 DERIVED sc ALL name,name_2 NULL NULL NULL 2099 Using temporary; Using filesort
2 DERIVED gp ref gameid,colour,name name 17 development.sc.name 2
2 DERIVED ga eq_ref PRIMARY,id,id_2 PRIMARY 4 development.gp.gameid 1 Using index
2 DERIVED dg ref gameid,winner gameid 4 development.ga.id 1 Using where
2 DERIVED dp ref gameid_2,colour gameid_2 4 development.ga.id 10 Using where
First of all, the SQL is badly formatted. The most obvious error is the line splitting before each
ASclause. Second obvious problem is using implicit joins instead of explicitly usingINNER JOIN ... ON ....Now to answer the actual question.
Without knowing the data or the environment, the first thing I’d look at would be some of the MySQL server settings, such as
sort_bufferandkey_buffer. If you haven’t changed any of these, go read up on them. The defaults are extremely conservative and can often be raised more than ten times their default, particularly on the large iron like you have.Having reviewed that, I’d be running pieces of the query to see speed and what
EXPLAINsays. The effect of indexing can be profound, but MySQL has a “fingers-and-toes” problem where it just can’t use more than one per table. AndJOINs with filtering can need two. So it has to descend to a rowscan for the other check. But having said that, dicing up the query and trying different combinations will show you where it starts stumbling.Now you will have an idea where a “tipping point” might be: this is where a small increase in some raw data size, like how much it needs to extract, will result in a big loss of performance as some internal structure gets too big. At this point, you will probably want to raise the temporary tables size. Beware that this kind of optimization is a bit of a black art. 🙂
However, there is another approach: denormalization. In a simple implementation, regularly scheduled scripts will run this expensive query from time-to-time and poke the data into a separate table in a structure much closer to what you want to display. There are multiple variations of this approach. It can be possible to keep this up-to-date on-the-fly, either in the application, or using table triggers. At the other extreme, you could allow your application to run the expensive query occasionally, but cache the result for a little while. This is most effective if a lot of people will call it often: even 2 seconds cache on a request that is run 15 times a second will show a visible improvement.
You could find ways of producing the same data by running half-a-dozen queries that each return some of the data, and post-processing the data. You could also run version of your original query that returns more data (which is likely to be much faster because it does less filtering) and post-process that. I have found several times that five simpler, smaller queries can be much faster – an order of magnitude, sometimes two – than one big query that is trying to do it all.