Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6961775
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T15:34:28+00:00 2026-05-27T15:34:28+00:00

my understanding is that HASH JOIN only makes sense when one of the 2

  • 0

my understanding is that HASH JOIN only makes sense when one of the 2 tables is small enough to fit into memory as a hash table.

but when I gave a query to oracle, with both tables having several hundred million rows, oracle still came up with a hash join explain plan. even when I tricked it with OPT_ESTIMATE(rows = ….) hints, it always decides to use HASH JOIN instead of merge sort join.

so I wonder how is HASH JOIN possible in the case of both tables being very large?

thanks
Yang

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T15:34:28+00:00Added an answer on May 27, 2026 at 3:34 pm

    Hash joins obviously work best when everything can fit in memory. But that does not mean they are not still the best join method when the table can’t fit in memory. I think the only other realistic join method is a merge sort join.

    If the hash table can’t fit in memory, than sorting the table for the merge sort join can’t fit in memory either. And the merge join needs to sort both tables. In my experience, hashing is always faster than sorting, for joining and for grouping.

    But there are some exceptions. From the Oracle® Database Performance Tuning Guide, The Query Optimizer:

    Hash joins generally perform better than sort merge joins. However,
    sort merge joins can perform better than hash joins if both of the
    following conditions exist:

      The row sources are sorted already.
      A sort operation does not have to be done.
    

    Test

    Instead of creating hundreds of millions of rows, it’s easier to force Oracle to only use a very small amount of memory.

    This chart shows that hash joins outperform merge joins, even when the tables are too large to fit in (artificially limited) memory:

    Hash vs Merge


    Notes

    For performance tuning it’s usually better to use bytes than number of rows. But the "real" size of the table is a difficult thing to measure, which is why the chart displays rows. The sizes go approximately from 0.375 MB up to 14 MB. To double-check that these queries are really writing to disk you can run them with /*+ gather_plan_statistics */ and then query v$sql_plan_statistics_all.

    I only tested hash joins vs merge sort joins. I didn’t fully test nested loops because that join method is always incredibly slow with large amounts of data. As a sanity check, I did compare it once with the last data size, and it took at least several minutes before I killed it.

    I also tested with different _area_sizes, ordered and unordered data, and different distinctness of the join column (more matches is more CPU-bound, less matches is more IO bound), and got relatively similar results.

    However, the results were different when the amount of memory was ridiculously small. With only 32K sort|hash_area_size, merge sort join was significantly faster. But if you have so little memory you probably have more significant problems to worry about.

    There are still many other variables to consider, such as parallelism, hardware, bloom filters, etc. People have probably written books on this subject, I haven’t tested even a small fraction of the possibilities. But hopefully this is enough to confirm the general consensus that hash joins are best for large data.


    Code

    Below are the scripts I used:

    --Drop objects if they already exist
    drop table test_10k_rows purge;
    drop table test1 purge;
    drop table test2 purge;
    
    --Create a small table to hold rows to be added.
    --("connect by" would run out of memory later when _area_sizes are small.)
    --VARIABLE: More or less distinct values can change results.  Changing
    --"level" to something like "mod(level,100)" will result in more joins, which
    --seems to favor hash joins even more.
    create table test_10k_rows(a number, b number, c number, d number, e number);
    insert /*+ append */ into test_10k_rows
        select level a, 12345 b, 12345 c, 12345 d, 12345 e
        from dual connect by level <= 10000;
    commit;
    
    --Restrict memory size to simulate running out of memory.
    alter session set workarea_size_policy=manual;
    
    --1 MB for hashing and sorting
    --VARIABLE: Changing this may change the results.  Setting it very low,
    --such as 32K, will make merge sort joins faster.
    alter session set hash_area_size = 1048576;
    alter session set sort_area_size = 1048576;
    
    --Tables to be joined
    create table test1(a number, b number, c number, d number, e number);
    create table test2(a number, b number, c number, d number, e number);
    
    --Type to hold results
    create or replace type number_table is table of number;
    
    set serveroutput on;
    
    --
    --Compare hash and merge joins for different data sizes.
    --
    declare
        v_hash_seconds number_table := number_table();
        v_average_hash_seconds number;
        v_merge_seconds number_table := number_table();
        v_average_merge_seconds number;
    
        v_size_in_mb number;
        v_rows number;
        v_begin_time number;
        v_throwaway number;
    
        --Increase the size of the table this many times
        c_number_of_steps number := 40;
        --Join the tables this many times
        c_number_of_tests number := 5;
    
    begin
        --Clear existing data
        execute immediate 'truncate table test1';
        execute immediate 'truncate table test2';
    
        --Print headings.  Use tabs for easy import into spreadsheet.
        dbms_output.put_line('Rows'||chr(9)||'Size in MB'
            ||chr(9)||'Hash'||chr(9)||'Merge');
    
        --Run the test for many different steps
        for i in 1 .. c_number_of_steps loop
            v_hash_seconds.delete;
            v_merge_seconds.delete;
            --Add about 0.375 MB of data (roughly - depends on lots of factors)
            --The order by will store the data randomly.
            insert /*+ append */ into test1
            select * from test_10k_rows order by dbms_random.value;
    
            insert /*+ append */ into test2
            select * from test_10k_rows order by dbms_random.value;
    
            commit;
    
            --Get the new size
            --(Sizes may not increment uniformly)
            select bytes/1024/1024 into v_size_in_mb
            from user_segments where segment_name = 'TEST1';
    
            --Get the rows.  (select from both tables so they are equally cached)
            select count(*) into v_rows from test1;
            select count(*) into v_rows from test2; 
    
            --Perform the joins several times
            for i in 1 .. c_number_of_tests loop
                --Hash join
                v_begin_time := dbms_utility.get_time;
                select /*+ use_hash(test1 test2) */ count(*) into v_throwaway
                from test1 join test2 on test1.a = test2.a;
                v_hash_seconds.extend;
                v_hash_seconds(i) := (dbms_utility.get_time - v_begin_time) / 100;
    
                --Merge join
                v_begin_time := dbms_utility.get_time;
                select /*+ use_merge(test1 test2) */ count(*) into v_throwaway
                from test1 join test2 on test1.a = test2.a;
                v_merge_seconds.extend;
                v_merge_seconds(i) := (dbms_utility.get_time - v_begin_time) / 100;
            end loop;
    
            --Get average times.  Throw out first and last result.
            select ( sum(column_value) - max(column_value) - min(column_value) ) 
                / (count(*) - 2)
            into v_average_hash_seconds
            from table(v_hash_seconds);
    
            select ( sum(column_value) - max(column_value) - min(column_value) ) 
                / (count(*) - 2)
            into v_average_merge_seconds
            from table(v_merge_seconds);
    
            --Display size and times
            dbms_output.put_line(v_rows||chr(9)||v_size_in_mb||chr(9)
                ||v_average_hash_seconds||chr(9)||v_average_merge_seconds);
    
        end loop;
    end;
    /
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Understanding that I should probably just dig into the source to come up with
It is my understanding that one can handle events through Swing by adding a
Our class is learning about hash tables, and one of my study questions involves
Why one languages uses tree and another uses hash table for seemingly similar data
My understanding is that a hash code and checksum are similar things - a
My understanding of a message digest is that it's an encrypted hash of some
It is my understanding that two unequal objects can have the same hashcode. How
It's my understanding that in Spring, all objects are treated by default as singletons.
It's my understanding that all three of these lines below should return an ARRAY
It is my understanding that a texture atlas is basically a single texture that

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.