Here is a simplified version of my problem. I have two tables. Each table has a unique ID field, but it’s irrelevant in this case.
shipments has 3 fields: shipment_id, receive_by_datetime, and qty.
deliveries has 4 fields: delivery_id, shipment_id, delivered_on_datetime, and qty.
In shipments, the shipment_id and receive_by_datetime fields always match up. There are many rows in the table that would appear to be duplicates based off of those two columns (but they aren’t… other fields are different).
In deliveries, the shipment_id matches up to the shipments table. There are also many rows that would appear to be duplicates based off of the delivery_id and delivered_on_datetime fields (but they aren’t again… other fields exist that I didn’t list).
I am trying to pull one row per aggregate delivered_on_datetime and receive_by_datetime, but because of the many-to-many relationships, it’s difficult. Is a query somewhere along these lines correct?
SELECT d.delivered_on_datetime, s.receive_by_datetime, SUM(d.qty) FROM deliveries d LEFT JOIN ( SELECT DISTINCT s1.shipment_id, s1.receive_by_datetime FROM shipments s1 ) s ON (s.shipment_id = d.shipment_id) GROUP BY d.delivered_on_datetime, s.receive_by_datetime
Yep, the problem with many-to-many is you get the cartesian product of rows, so you end up counting the same row more than once. Once for each other row it matches against.
If this means there cannot be two shipments with the same ID but different dates then your query will work. But in general it is not safe. i.e. If subselect distinct could return more than one row per shipment ID, you will be subject to the double counting issue. Generically this is a very tricky problem to solve – in fact I see no way it could be with this data model.