I have this query which only runs once per request.
SELECT SUM(numberColumn) AS total, groupColumn
FROM myTable
WHERE dateColumn < ? AND categoryColumn = ?
GROUP BY groupColumn
HAVING total > 0
myTable has less than a dozen columns and can grow up to 5 millions of rows, but more likely about 2 millions in production. All columns used in the query are numbers, except for dateColumn, and there are indexes on dateColumn and categoryColumn.
Would it be reasonble to expect this query to run in under 5 seconds with 5 million rows on most modern servers if the database is properly optimized?
The reason I’m asking is that we don’t have 5 millions of data and we won’t even hit 2 millions within the next few years, if the query doesn’t run in under 5 seconds then, it’s hard to know where the problem lies. Would it be because the query is not suitable for a large table, or the database isn’t optimized, or the server isn’t powerful enough? Basically, I’d like to know whether using SUM() and GROUP BY over a large table is reasonable.
Thanks.
As people in comments under your question suggested, the easiest way to verify is to generate random data and test query execution time. Please note that using clustered index on dateColumn can significantly change execution times due to the fact, that with “<” condition only subset of continuous disk data is retrieved in order to calculate sums.
If you are at the beginning of the process of development, I’d suggest concentrating not on the structure of table and indexes that collects data – but rather what do you expect to need to retrieve from the table in the future. I can share my own experience with presenting website administrator with web usage statistics. I had several webpages being requested from server, each of them falling into one on more “categories”. My first approach was to collect each request in log table with some indexes, but the table grew much larger than I had at first estimated. 🙂 Due to the fact that statistics where analyzed in constant groups (weekly, monthly, and yearly) I decided to create addidtional table that was aggregating requests in predefined week/month/year grops. Each request incremented relevant columns – columns were refering to my “categories” . This broke some normalization rules, but allowed me to calculate statistics in a blink of an eye.