Edit 1 (clarification): Thank you for the answers so far! The response is gratifying.
I want to clarify the question a little because based on the answers I think I did not describe one aspect of the problem correctly (and I’m sure that’s my fault as I was having a difficult time defining it even for myself).
Here’s the rub: The result set should contain ONLY the records with tstamp BETWEEN ‘2010-01-03’ AND ‘2010-01-09’, AND the one record where the tstamp IS NULL for each order_num in the first set (there will always be one with null tstamp for each order_num).
The answers given so far appear to include all records for a certain order_num if there are any with tstamp BETWEEN ‘2010-01-03’ AND ‘2010-01-09’. For example, if there were another record with order_num = 2 and tstamp = 2010-01-12 00:00:00 it should not be included in the result.
Original question:
Consider an orders table containing id (unique), order_num, tstamp (a timestamp), and item_id (the single item included in an order). tstamp is null, unless the order has been modified, in which case there is another record with identical order_num and tstamp then contains the timestamp of when the change occurred.
Example…
id order_num tstamp item_id __ _________ ___________________ _______ 0 1 100 1 2 101 2 2 2010-01-05 12:34:56 102 3 3 113 4 4 124 5 5 135 6 5 2010-01-07 01:23:45 136 7 5 2010-01-07 02:46:00 137 8 6 100 9 6 2010-01-13 08:33:55 105
What is the most efficient SQL statement to retrieve all of the orders (based on order_num) which have been modified one or more times during a certain date range? In other words, for each order we need all of the records with the same order_num (including the one with NULL tstamp), for each order_num WHERE at least one of the order_num’s has tstamp NOT NULL AND tstamp BETWEEN ‘2010-01-03’ AND ‘2010-01-09’. It’s the “WHERE at least one of the order_num’s has tstamp NOT NULL” that I’m having difficulty with.
The result set should look like this:
id order_num tstamp item_id __ _________ ___________________ _______ 1 2 101 2 2 2010-01-05 12:34:56 102 5 5 135 6 5 2010-01-07 01:23:45 136 7 5 2010-01-07 02:46:00 137
The SQL that I came up with is this, which is essentially “A UNION (B in A)”, but it executes slowly and I hope there is a more efficient solution:
SELECT history_orders.order_id, history_orders.tstamp, history_orders.item_id
FROM
(SELECT orders.order_id, orders.tstamp, orders.item_id
FROM orders
WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09')
AS history_orders
UNION
SELECT current_orders.order_id, current_orders.tstamp, current_orders.item_id
FROM
(SELECT orders.order_id, orders.tstamp, orders.item_id
FROM orders
WHERE orders.tstamp IS NULL)
AS current_orders
WHERE current_orders.order_id IN
(SELECT orders.order_id
FROM orders
WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09');
Thank you again for all the suggestions. I found three solutions that work, including my original. At the end I’ve added some performance results, which are not as great as I had hoped. If anyone can improve on this I would be thrilled!
1) The best solution found so far seems to be:
SELECT history_orders.order_id, history_orders.tstamp, history_orders.item_id FROM (SELECT orders.order_id, orders.tstamp, orders.item_id FROM orders WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09' OR orders.tstamp IS NULL) AS history_orders WHERE history_orders.order_id IN (SELECT orders.order_id FROM orders WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09');2) I also tried using EXISTS in place of IN, which requires an additional WHERE clause in the last SELECT:
SELECT history_orders.order_id, history_orders.tstamp, history_orders.item_id FROM (SELECT orders.order_id, orders.tstamp, orders.item_id FROM orders WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09' OR orders.tstamp IS NULL) AS history_orders WHERE EXISTS (SELECT orders.order_id FROM orders WHERE history_orders.order_id = orders.order_id AND orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09');3) And finally there is my original solution, using UNION.
Comments:
To comment on the table size, my actual “real world” problem involves 4 tables (connected with inner joins) containing 98, 2189, 43897, 785656 records respectively.
Performance – I ran each solution three times and here are my real world results:
1: 52, 51, 51 seconds
2: 54, 54, 53 s
3: 56, 56, 56 s