Sorry for the length, wanted to give a complete description! I have a need to show a report displaying some info about an id from another table and when someone changes countries from a country and within x amount of days. Note how i can have the same country entry in the table multiple times for an id (as the info is queried at regular intervals multiple times, but they may not have moved during that time), and can also have different country entries (as they change countries).
Quick explanation of the data:
i have the table below:
CREATE TABLE IF NOT EXISTS `country` (
`id` mediumint(8) unsigned NOT NULL,
`timestamp` datetime NOT NULL,
`country` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`,`timestamp`),
KEY `country` (`country`),
KEY `timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and the entrys are like this:
41352 2012-03-26 15:46:01 Jamaica
41352 2012-03-05 22:49:41 Jamaican Applicant
41352 2012-02-26 15:46:01 Jamaica
41352 2012-02-16 12:11:19 Jamaica
41352 2012-02-05 23:00:30 Jamaican Applicant
This table has about ~214,590 total rows right now, but will have millions once the test data is replaced with real data.
What I want is some info on everyone who has left x country since y time. Here is how I would like it outputted assuming it was run on the data above:
id name last country TIMESTAMP o_timestamp
41352 Sweet Mercy Jamaica 2012-03-26 15:46:01 2012-03-05 22:49:41
41352 Sweet Mercy Jamaica 2012-02-16 12:11:19 2012-02-05 23:00:30
Where o_timestamp is newer then a certain date (lets say 100), country is where they moved to, and the old country (not shown) they came from is whatever i pass into the query (Jamaican Applicant based on above data).
I developed the following query to satisfy the requirements and was using a certain id to test:
SELECT a.id,
c.name,
c.last,
a.country,
a.timestamp,
b.timestamp AS o_timestamp
FROM country a
INNER JOIN user_info c
ON ( a.id = c.id )
LEFT JOIN country AS b
ON ( a.id = b.id
AND a.timestamp != b.timestamp
AND a.country != b.country )
WHERE b.timestamp = (SELECT c.timestamp
FROM country c
WHERE a.id = c.id
AND a.timestamp > c.timestamp
ORDER BY c.timestamp DESC
LIMIT 1)
AND a.id = 965
I got this to complete in ( 7 total, Query took 0.0050 sec)
and a explain extended revealed the following:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY c const PRIMARY PRIMARY 3 const 1 100.00
1 PRIMARY a ref PRIMARY PRIMARY 3 const 16 100.00
1 PRIMARY b eq_ref PRIMARY,timestamp PRIMARY 11 const,func 1 100.00 Using where
2 DEPENDENT SUBQUERY c index PRIMARY,timestamp timestamp 8 NULL 1 700.00 Using where; Using index
so i figured I was pretty good and popped in this:
SELECT a.id,
c.name,
c.last,
a.country,
a.timestamp,
b.timestamp AS o_timestamp
FROM country a
INNER JOIN user_info c
ON ( a.id = c.id )
LEFT JOIN country AS b
ON ( a.id = b.id
AND a.timestamp != b.timestamp
AND a.country != b.country )
WHERE b.timestamp = (SELECT c.timestamp
FROM country c
WHERE a.id = c.id
AND a.timestamp > c.timestamp
ORDER BY c.timestamp DESC
LIMIT 1)
AND b.country = "whatever" AND timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)
This query took an amazing 6 minutes and 54 seconds to complete on a country that had 200 records and never completed (after going out for the afternoon and night and
coming home so a total of about 8 hours) for a country with 9000 records in the db. In real data, a country could be in there 10000 times easy. 100k would not be unreasonable.
So i do explain extended, and get this:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 3003 100.00
1 PRIMARY c eq_ref PRIMARY PRIMARY 3 b.id 1 100.00
1 PRIMARY a ref PRIMARY PRIMARY 3 b.id 7 100.00 Using where
3 DEPENDENT SUBQUERY c index PRIMARY,timestamp timestamp 8 NULL 1 700.00 Using where; Using index
2 DERIVED country range country,timestamp country 195 NULL 474 100.00 Using where; Using index
So it looks larger, but not unreasonably so.
[Removed config variables for space, let me know if needed and also the performance info since its prob a query thing.]
Let me know if i missed anything.
The problem isn’t adding a criterion; it is dropping one that’s doing the damage. In the original query, you had:
This means that the query execution does not need to read the entire
a(country) table. In your second, performance-killed query, you change that criterion to:You no longer have a really restrictive criterion on
a, so things work much more slowly.Things get more complex when it is realized that
bis another reference tocountry. Nevertheless, the change from a condition onatob(wherebis on the outer side of an outer join) is not trivial; it takes a lot longer to deal with the query conditions.With the given query structure, the answer seems to be ‘yes’, but the given query structure may be, shall we say, sub-optimal.
Your ‘fast enough when working on one ID’ query is:
I don’t fully understand this query and what it is attempting to do. You need to be aware that outer joins are more expensive than inner joins, and conditions on the outer-joined table like
are fiendishly expensive. One problem is that there might be a NULL in the
bcolumns includingtimestamp, but the sub-query is wasted on that because the condition won’t be satisfied unless the values are non-null, so we end up wondering ‘why an OUTER join’?When you added the revised condition, you should have received an ‘ambiguous column name’ error since that time stamp could be from
aorc. Also, theb.country = "whatever"condition is another that only makes sense when thebvalues are not null, so again, the OUTER join is dubious.As I understand it, the
countrytable contains records about who entered which country and when. Also, FWIW, I’m tolerably certain that the join with theuser_infotable is a negligible performance issue; the problem is all down to the three references to thecountrytable.Judging from some of the clarifications, you could build up the query incrementally, maybe something like this.
Find each pair of country records for the same
idwhere the records are adjacent in time sequence, and the older of the pair is for a given country (‘Jamaica Applicant’) and the newer is for a different country.The easy part of this is:
This does most of the job, but does not ensure adjacency for the entries. To do that, we have to insist there there is not record in
countrytable for the sameidin between (but not including) the two timestamps,a.timestampandb.timestamp. That’s an extra NOT EXISTS condition:Note that BETWEEN AND notation is not suitable. It includes the end points in the range, but we explicitly need the end points excluded.
Given the list of country entries above, we now need to select just those rows where the … hmmm, well, what is the criterion? I think you get to choose, but the result can be joined with the
user_infotable easily:I’m not about to guarantee that the performance will be better (or even that it is syntactically correct; it hasn’t been past an SQL DBMS). But I think the complex query structure for getting the adjacent dates is neater and probably better performing than the original code. Note, in particular, that it does not use any outer join, (explicit) ordering or limit clauses. That should help.