I’m having an issue with some databases/queries I inherited. This is for some large datasets and reporting being done on them.
I’m trying to tweak and tune to get some improvements.
What’s happening is I’m not 100% clear on how MySQL is deciding which index to use.
Why is the first query listed below not using the index that gets used in query 2. In query 2 I’m doing what I’m assuming the query engine should be doing, take the small table, get the appropriate values, then apply them to searching the bigger table, and utilize the appropriate index.
What am I doing wrong here? Or rather, what am I misunderstanding about how foreign keys, indexes, and joins work here 🙂
Here are the 2 relevant tables
Table 1
~450 rows
CREATE TABLE `client_accounts_dim` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`client_id` int(10) unsigned NOT NULL,
`service_provider_id` int(10) unsigned NOT NULL,
`account_number` varchar(45) NOT NULL,
`label` varchar(128) DEFAULT NULL,
`service_provider_name` varchar(45) NOT NULL,
`client_name` varchar(45) NOT NULL,
PRIMARY KEY (`id`),
KEY `client_id` (`client_id`,`account_number`)
) ENGINE=InnoDB;
Table 2
~11,000,000 rows
CREATE TABLE `invoices_fact` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`invoice_number` varchar(45) NOT NULL COMMENT ' ',
...
...
`tracking_number` varchar(45) DEFAULT NULL,
`division_id` int(11) DEFAULT NULL,
`client_accounts_dim_id` int(10) unsigned NOT NULL,
`invoice_date_dim_id` bigint(20) DEFAULT NULL,
`shipment_date_dim_id` bigint(20) NOT NULL,
`received_date_dim_id` bigint(20) NOT NULL,
PRIMARY KEY (`id`),
KEY `fk_invoice_details_client_accounts_dim1_idx` (`client_accounts_dim_id`),
KEY `invoice_date_dim_id` (`invoice_date_dim_id`),
KEY `shipment_date_dim_id` (`shipment_date_dim_id`,`client_accounts_dim_id`,`division_id`,`tracking_number`),
CONSTRAINT `fk_invoice_details_client_accounts_dim1` FOREIGN KEY (`client_accounts_dim_id`) REFERENCES `client_accounts_dim` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB;
First query where I do a basic join
SELECT count(distinct tracking_number) as val, p.division_id as division_id
FROM client_accounts_dim c, invoices_fact p
WHERE c.id = p.client_accounts_dim_id
AND p.division_id IN (2,3,7)
AND c.client_id = 17
AND p.shipment_date_dim_id between 20120101 and 20121108
GROUB BY p.division_id;
Runs in 28s
Explain yields
+----+-------------+-------+------+------------------------------------------------------------------+---------------------------------------------+---------+---------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------------------------------------------------+---------------------------------------------+---------+---------+------+-------------+
| 1 | SIMPLE | c | ref | PRIMARY,client_id | client_id | 4 | const | 49 | Using index |
| 1 | SIMPLE | p | ref | fk_package_details_client_accounts_dim1_idx,shipment_date_dim_id | fk_package_details_client_accounts_dim1_idx | 4 | c.id | 913 | Using where |
+----+-------------+-------+------+------------------------------------------------------------------+---------------------------------------------+---------+---------+------+-------------+
Query where I do the join “manually” by running a query first, then putting the client_accounts_dim_ids in.
SELECT count(distinct tracking_number) as val, p.division_id as division_id
FROM invoices_fact p
WHERE division_id in (2,3,7)
AND p.client_accounts_dim_id IN ( 232, 233, 234, 277, 235, 236, 279, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 278, 280, 262, 263, 264, 252, 256, 254, 259, 261, 257, 266, 276, 267, 255, 258, 274, 273, 272, 271, 269, 270, 268, 275, 253, 265, 260 )
AND p.shipment_date_dim_id between 20120101 and 20121108
GROUP BY p.division_id;
Runs in 1.6s
Explain yields:
+----+-------------+-------+-------+------------------------------------------------------------------+------------------------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------------------------------------------------+------------------------+---------+------+---------+--------------------------+
| 1 | SIMPLE | p | range | fk_package_details_client_accounts_dim1_idx,shipment_date_dim_id | shipment_date_dim_id | 19 | NULL | 4991810 | Using where; Using index |
+----+-------------+-------+-------+------------------------------------------------------------------+------------------------+---------+------+---------+--------------------------+
MySQL should indeed be looking at the smallest table first, which it is –
client_accounts_dim. You’ve given it theclient_idindex, so it can pull the info forclient_id=17easily.Then, mysql needs to take the
idand join that over toinvoice_fact. You’ve given it thefk_invoice_details_client_accounts_dim1_idxfor this task. It all sounds so reasonable!Now, two questions, one hard and one easy. First:
And the second:
For the first question, I’ve read that InnoDB puts the primary key into all the subsequent indexes, but I can’t put my fingers on a definitive explanation that says it will use it for your join. I would suggest making this an explicit composite index:
For the second question, once MySQL has found the joined info in the index, it has to go read all the appropriate rows from disk in order to see which of them are in the division and date range you specified. Another composite index to the rescue:
NOTE: put the columns 2&3 in the correct order, with the lowest cardinality column first.
Now, MySQL can just haunt your indexes to gather the complete list of rows!
Aside from the columns discussed above for the joins, it looks like you’re only using one more column –
invoices_fact.tracking_number. If you add that to your index, MySQL can get everything that it needs for your query from the index, without ever reading the underlying rows from disk.NOTE:
tracking_numberis a wide column, which will bulk up your index, slowing down writes, taking up more disk space, etc. You might test it both ways.Hope this helps.