I’ve got this rather insane query for finding all but the FIRST record with a duplicate value. It takes a substantially long time to run on 38000 records; about 50 seconds.
UPDATE exr_exrresv
SET mh_duplicate = 1
WHERE exr_exrresv._id IN
(
SELECT F._id
FROM exr_exrresv AS F
WHERE Exists
(
SELECT PHONE_NUMBER,
Count(_id)
FROM exr_exrresv
WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER
AND exr_exrresv.PHONE_NUMBER != ''
AND mh_active = 1 AND mh_duplicate = 0
GROUP BY exr_exrresv.PHONE_NUMBER
HAVING Count(exr_exrresv._id) > 1)
)
AND exr_exrresv._id NOT IN
(
SELECT Min(_id)
FROM exr_exrresv AS F
WHERE Exists
(
SELECT PHONE_NUMBER,
Count(_id)
FROM exr_exrresv
WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER
AND exr_exrresv.PHONE_NUMBER != ''
AND mh_active = 1
AND mh_duplicate = 0
GROUP BY exr_exrresv.PHONE_NUMBER
HAVING Count(exr_exrresv._id) > 1
)
GROUP BY PHONE_NUMBER
);
Any tips on how to optimize it or how I should begin to go about it? I’ve checked out the query plan but I’m really not sure how to begin improving it. Temp tables? Better query?
Here is the explain query plan output:
0|0|0|SEARCH TABLE exr_exrresv USING INTEGER PRIMARY KEY (rowid=?) (~12 rows)
0|0|0|EXECUTE LIST SUBQUERY 0
0|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows)
1|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|EXECUTE LIST SUBQUERY 2
2|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows)
2|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows)
3|0|0|USE TEMP B-TREE FOR GROUP BY
2|0|0|USE TEMP B-TREE FOR GROUP BY
Any tips would be much appreciated. 🙂
Also, I am using Ruby to make the sql query so if it makes more sense for the logic to leave SQL and be written in Ruby, that’s possible.
The schema is as follows, and you can use sqlfiddle here: http://sqlfiddle.com/#!2/2c07e
_id INTEGER PRIMARY KEY
OPPORTUNITY_ID varchar(50)
CREATEDDATE varchar(50)
FIRSTNAME varchar(50)
LASTNAME varchar(50)
MAILINGSTREET varchar(50)
MAILINGCITY varchar(50)
MAILINGSTATE varchar(50)
MAILINGZIPPOSTALCODE varchar(50)
EMAIL varchar(50)
CONTACT_PHONE varchar(50)
PHONE_NUMBER varchar(50)
CallFromWeb varchar(50)
OPPORTUNITY_ORIGIN varchar(50)
PROJECTED_LTV varchar(50)
MOVE_IN_DATE varchar(50)
mh_processed_date varchar(50)
mh_control INTEGER
mh_active INTEGER
mh_duplicate INTEGER
Guessing from your post, it looks like you are trying to update the
mh_duplicatecolumn for any record that has the same phone number if it’s not the first record with that phone number?If that’s correct, I think this should get you the id’s to update (you may need to add back your appropriate where criteria) — from there, the Update is straight-forward:
And the SQL Fiddle.
Good luck.