I have a table called users with roughly 250,000 records in it. I have another table called staging with around 75,000 records in it. Staging only has one column, msisdn. I want to check to see how many rows in staging are not present in users.
I have the following query, which I have tested on a small data subset, and it seems to work fine:
SELECT
s.*
FROM staging s
LEFT OUTER JOIN users u ON u.msisdn=s.msisdn
WHERE u.msisdn IS NULL
The problem however, is when I try to run this query on the full list of 250k users. It ran for an hour before I stopped it. Is there any way I can optimise this query?
I have started running the query on subsets of the data in staging, but this is horribly manual:
SELECT
s.*
FROM staging s
LEFT OUTER JOIN users u ON u.msisdn=s.msisdn
WHERE u.msisdn IS NULL
LIMIT 0,10000
msisdn is the primary key of the staging table, but it’s not the primary key of the table users. I don’t know if that is significant however.
First, you can see what indices MySQL is using with the EXPLAIN command. Just write
EXPLAINin front of your query, and the results will show what index (if any) it’s using. Presumably if it’s that slow on so (relatively) small a data set as 250,000 records, it’s not exploiting a very effective index, and you’ll be able to see where.It may also help to rewrite the query as a
NOT EXISTSlike so: