Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9047651
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T11:57:44+00:00 2026-06-16T11:57:44+00:00

I have 4 very large tables. Let me call them X, A, B and

  • 0

I have 4 very large tables. Let me call them X, A, B and C.

I want to create two more tables X1 and X2 from X as follows:

Consider a record r in table X. If r has a corresponding record in at least one of the tables A, B and C, I put it in X1. Else I put it in X2.

(How do I decide that r has a corresponding record in A, B or C? I compare a few fields of r with a few fields of a record in A, B or C. The fields may be different for A, B or C and there may be more than one criterion to match r with a record in A, B or C. Probably this part is not that relevant to the main problem.)

I have both the options: I can have X, A, B and C as Oracle tables or SAS datasets.

What is the most efficient way of solving this problem?

Regards,

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T11:57:45+00:00Added an answer on June 16, 2026 at 11:57 am

    Tartaglia’s answer is fairly close, but it’s probably easier to do it in one step.

    data x1 x2;
    merge x(in=x) a(in=found keep=id) b(in=found keep=id) c(in=found keep=id);
    by id;
    if x and found then output x1;
    else if x then output x2;
    run;
    

    Ensure ‘found’ and ‘x’ are not variables on any original dataset, otherwise use something else
    The only complicating factor is if you want some variables other than ID from a,b,c; if you do, then you need to work out how to ensure you get the right variables if you have a multiple match scenario. Also requires sorting all four tables (may be slow).

    Another SAS solution: Hash tables. This does not require sorting your datasets. This is probably faster if your datasets aren’t already in order. However, it does require enough memory to store all of tables a,b, and c in memory, which might be constraining depending on the size of those datasets; and it’s better when a,b,c are small relative to x rather than when they’re of similar sizes. This could be manipulated to yield data from a/b/c rather than just a return code, using defineData, but again you’d have to think about what you want to do if it’s found in two of a,b,c (or all three).

    data abc/view=abc;
    set a b c;
    keep id;
    run;
    
    data x1 x2;
    if _n_ = 1 then do;
     declare hash abc(dataset:"abc");
     abc.defineKey("id");
     abc.defineDone();
     call missing(id);
    end;
    set x;
    rc = abc.find();
    if rc=0 then output x1;
    else output x2;
    run;
    

    To do it in oracle, the way I think I’d do it is to do something closer to tartaglia’s solution – create three ‘match’ tables and then union them (removing duplicates in the union), and then create x2 as the x minus x1 table. IE (this works in PROC SQL in SAS, not sure if oracle is exactly the same for except):

    create table x1 as
      select x.* from x,a where x.id=a.id
      union
      select x.* from x,b where x.id=b.id
      union
      select x.* from x,c where x.id=c.id
    ;
    create table x2 as
      select * from x except select * from x1;
    

    I tested these out using SAS (including the SQL solution, which Oracle may be a bit better at but should be similar order – though if your oracle server is faster than your sas server, that may change things some).

    Using a dataset ‘x’ with 5e7 records, and three datasets ‘a’ ‘b’ ‘c’ with fair overlap (probably 25% or so of records are in 2 or more datasets, and 84% are in one or more) and between 1.5e7 and 3e7 records in each (specifically, one had all odd numbers, one had multiples of 3, and one had even multiples of 4 in it), the SQL solution took over 5 minutes to process while the sort-and-merge solution took around 2.5 minutes to sort and 0.5 minutes to merge, so around 3 minutes total. This may be slightly exaggerated as the datasets were created sorted, so the sort itself may have been somewhat faster (though SQL also would gain some from the datasets being in order).

    This compares to write-out time of about 5 seconds for the 5e7 dataset x.

    The hash solution wouldn’t fit into memory on my laptop with the overall ~6e7 record dataset abc, so I shrunk them some to a total of ~2e7 (so the odds from 1 to 2e7, then multiples of 3 from 2e7 to 4e7, then multiples of 4 from 4e7 to 6e7) but left x having 5e7 records in it. The hash solution then took 1:41 in total, compared to the sort and merge solution which took a similar time, most of which was sorting x (about a minute) and merging/writing out the resulting datasets (about half of a minute). That was much faster than sorting the larger datasets, as the smaller ones sort in memory while the larger ones couldn’t. The SQL solution was about 4 minutes with those datasets, so still substantially slower.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have two very large enterprise tables in an Oracle 10g database. One table
I have two very large tables, Table1 and Table2. They look like this: Table1
Let's say I have two tables in a database: projects and users. I create
I have a very large existing set of tables being replicated (transactional) between two
I have a very large Oracle database, with many many tables and millions of
I have a very large Excel sheet converted from a 6000 page PDF file,
Let's say I have a very large MySQL table with a timestamp field. So
I have 2 very large tables. I try to figure out what they have
Let's say there is a very large SQL database - hundreds of tables, thousands
I have a very large table of wagering transactions. Let's say for the sake

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.