I am tuning a query on SQL Server 2005.
Please note the real question is at the end.
I have following query, both pto and ph has about 30million rows. The query initially run very slow (3 mins). So I added two index on pto, ph respectively.
SELECT
MAX(ph.txn_date_time)
FROM
pto AS pto WITH (NOLOCK)
INNER JOIN ph AS ph WITH (NOLOCK) ON ph.receipt_id = pto.receipt_id
WHERE
pto.subtype = 'ff'
AND pto.Units_No > 0
AND ph.branch_id = 5
CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]
(
[SUBTYPE] ASC,
[Units_No] ASC,
[RECEIPT_ID] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
CREATE NONCLUSTERED INDEX [IX_ph_branchReceiptTxn] ON [dbo].[ph]
(
[BRANCH_ID] ASC,
[RECEIPT_ID] ASC,
[TXN_DATE_TIME] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
Now the query runs in 350ms. Great. The execution plan is also very simple, it uses the created index from the two tables and did a Hash join on the receipt_id column then a Stream Aggregate to do the MAX(ph.txn_date_time). So every column in the query is covered by the two added index.
The question is why it used a Hash join on the receipt_id column? I mean since RECEIPT_ID in both indexes are sorted the optimizer should have used a merge join. To figure out why I changed the first index to below (put RECEIPT_ID before Units_No).
CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]
(
[SUBTYPE] ASC,
[RECEIPT_ID] ASC,
[Units_No] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
And now I see the Merge join on the RECEIPT_ID column. The query also runs in 170ms. Now obviously the optimizer think the RECEIPT_ID in both indexes are sorted so a merge join is used. But I don’t understand why in the first case it doesn’t think so?
The reason is that
RECEIPT_IDisn’t the first sorted item in the indexes you had. You hadunits_noin the way.Imagine you had a row of books ordered by publisher, then by author, then by colour. If you wanted to find all the books of a specific colour, you would need to visit each publisher section, then each author section and then find the books of the right colour. So that ‘index’ wouldn’t be very appropriate for scanning by colour, even though you could, at a stretch, say the books were sorted by colour.
When you add the last index,
RECEIPT_IDis available sorted, because you are limiting the query bySUBTYPE. Therefore all of theRECEIPT_IDvalues from both sides are simply available, cost is low and a merge join is picked.