I am trying to query against a partitioned table (by month) approaching 20M rows. I need to group by DATE(transaction_utc) as well as country_id. The rows that get returned if i turn off the group by and aggregates is just over 40k, which isn’t too many, however adding the group by makes the query substantially slower unless said GROUP BY is on the transaction_utc column, in which case it gets FAST.
I’ve been trying to optimize this first query below by tweaking the query and/or the indexes, and got to the point below (about 2x as fast as initially) however still stuck with a 5s query for summarizing 45k rows, which seems way too much.
For reference, this box is a brand new 24 logical core, 64GB RAM, Mariadb-5.5.x server with way more INNODB buffer pool available than index space on the server, so shouldn’t be any RAM or CPU pressures.
So, I’m looking for ideas on what is causing this slow down and suggestions on speeding it up. Any feedback would be greatly appreciated! 🙂
Ok, onto the details…
The following query (the one I actually need) takes approx 5 seconds (+/-), and returns less than 100 rows.
SELECT lss.`country_id` AS CountryId
, Date(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`, DATE(lss.`transaction_utc`)
EXPLAIN SELECT for the same query is as follows. Notice that it’s not using the transaction_utc key. Shouldn’t it be using my covering index instead?
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE lss ref idx_unique,transaction_utc,country_id idx_unique 50 const 1208802 Using where; Using temporary; Using filesort
1 SIMPLE c eq_ref PRIMARY PRIMARY 4 georiot.lss.country_id 1
Now onto a couple other options that I’ve tried to attempt to determine whats going on…
The following query (changed group by) takes about 5 seconds (+/-), and returns only 3 rows:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`
The following query (removed group by) takes 4-5 seconds (+/-) and returns 1 row:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
The following query takes .00X seconds (+/-) and returns ~45k rows. This to me shows that at max we’re only trying to group 45K rows into less than 100 groups (as in my initial query):
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
GROUP BY lss.`transaction_utc`
TABLE SCHEMA:
CREATE TABLE IF NOT EXISTS `sales` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`user_linkshare_account_id` int(11) unsigned NOT NULL,
`username` varchar(16) NOT NULL,
`country_id` int(4) unsigned NOT NULL,
`order` varchar(16) NOT NULL,
`raw_tracking_code` varchar(255) DEFAULT NULL,
`transaction_utc` datetime NOT NULL,
`processed_utc` datetime NOT NULL ,
`sku` varchar(16) NOT NULL,
`sale_original` decimal(10,4) NOT NULL,
`sale_usd` decimal(10,4) NOT NULL,
`quantity` int(11) NOT NULL,
`commission_original` decimal(10,4) NOT NULL,
`commission_usd` decimal(10,4) NOT NULL,
`original_currency` char(3) NOT NULL,
PRIMARY KEY (`id`,`transaction_utc`),
UNIQUE KEY `idx_unique` (`username`,`order`,`processed_utc`,`sku`,`transaction_utc`),
KEY `raw_tracking_code` (`raw_tracking_code`),
KEY `idx_usd_amounts` (`sale_usd`,`commission_usd`),
KEY `idx_countries` (`country_id`),
KEY `transaction_utc` (`transaction_utc`,`username`,`country_id`,`sale_usd`,`commission_usd`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE ( TO_DAYS(`transaction_utc`))
(PARTITION pOLD VALUES LESS THAN (735112) ENGINE = InnoDB,
PARTITION p201209 VALUES LESS THAN (735142) ENGINE = InnoDB,
PARTITION p201210 VALUES LESS THAN (735173) ENGINE = InnoDB,
PARTITION p201211 VALUES LESS THAN (735203) ENGINE = InnoDB,
PARTITION p201212 VALUES LESS THAN (735234) ENGINE = InnoDB,
PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */ AUTO_INCREMENT=19696320 ;
The offending part is probably the
GROUP BY DATE(transaction_utc). You also claim to have a covering index for this query but I see none. Your 5-column index has all the columns used in the query but not in the best order (which is:WHERE–GROUP BY–SELECT).So, the engine, finding no useful index, would have to evaluate this function for all the 20M rows. Actually, it finds an index that starts with
username(theidx_unique) and it uses that, so it has to evaluate the function for (only) 1.2M rows. If you had a(transaction_utc)or a(username, transaction_utc)it would choose the most useful of the three.Can you afford to change the table structure by splitting the column into date and time parts?
If you can, then an index on
(username, country_id, transaction_date)or (changing the order of the two columns used for grouping), on(username, transaction_date, country_id)would be quite efficient.A covering index on
(username, country_id, transaction_date, sale_usd, commission_usd)even better.If you want to keep the current structure, try changing the order inside your 5-column index to:
or to:
Since you are using MariaDB, you can use the
VIRTUALcolumns feature, without changing the existing columns:Add a virtual (persistent) column and the appropriate index: