I have 4 very large tables. Let me call them X, A, B and C.
I want to create two more tables X1 and X2 from X as follows:
Consider a record r in table X. If r has a corresponding record in at least one of the tables A, B and C, I put it in X1. Else I put it in X2.
(How do I decide that r has a corresponding record in A, B or C? I compare a few fields of r with a few fields of a record in A, B or C. The fields may be different for A, B or C and there may be more than one criterion to match r with a record in A, B or C. Probably this part is not that relevant to the main problem.)
I have both the options: I can have X, A, B and C as Oracle tables or SAS datasets.
What is the most efficient way of solving this problem?
Regards,
Tartaglia’s answer is fairly close, but it’s probably easier to do it in one step.
Ensure ‘found’ and ‘x’ are not variables on any original dataset, otherwise use something else
The only complicating factor is if you want some variables other than ID from a,b,c; if you do, then you need to work out how to ensure you get the right variables if you have a multiple match scenario. Also requires sorting all four tables (may be slow).
Another SAS solution: Hash tables. This does not require sorting your datasets. This is probably faster if your datasets aren’t already in order. However, it does require enough memory to store all of tables a,b, and c in memory, which might be constraining depending on the size of those datasets; and it’s better when a,b,c are small relative to x rather than when they’re of similar sizes. This could be manipulated to yield data from a/b/c rather than just a return code, using defineData, but again you’d have to think about what you want to do if it’s found in two of a,b,c (or all three).
To do it in oracle, the way I think I’d do it is to do something closer to tartaglia’s solution – create three ‘match’ tables and then union them (removing duplicates in the union), and then create x2 as the x minus x1 table. IE (this works in PROC SQL in SAS, not sure if oracle is exactly the same for except):
I tested these out using SAS (including the SQL solution, which Oracle may be a bit better at but should be similar order – though if your oracle server is faster than your sas server, that may change things some).
Using a dataset ‘x’ with 5e7 records, and three datasets ‘a’ ‘b’ ‘c’ with fair overlap (probably 25% or so of records are in 2 or more datasets, and 84% are in one or more) and between 1.5e7 and 3e7 records in each (specifically, one had all odd numbers, one had multiples of 3, and one had even multiples of 4 in it), the SQL solution took over 5 minutes to process while the sort-and-merge solution took around 2.5 minutes to sort and 0.5 minutes to merge, so around 3 minutes total. This may be slightly exaggerated as the datasets were created sorted, so the sort itself may have been somewhat faster (though SQL also would gain some from the datasets being in order).
This compares to write-out time of about 5 seconds for the 5e7 dataset x.
The hash solution wouldn’t fit into memory on my laptop with the overall ~6e7 record dataset abc, so I shrunk them some to a total of ~2e7 (so the odds from 1 to 2e7, then multiples of 3 from 2e7 to 4e7, then multiples of 4 from 4e7 to 6e7) but left x having 5e7 records in it. The hash solution then took 1:41 in total, compared to the sort and merge solution which took a similar time, most of which was sorting x (about a minute) and merging/writing out the resulting datasets (about half of a minute). That was much faster than sorting the larger datasets, as the smaller ones sort in memory while the larger ones couldn’t. The SQL solution was about 4 minutes with those datasets, so still substantially slower.