I have a Hive query which is selecting about 30 columns and around 400,000

Question

0

Asked: June 7, 20262026-06-07T08:02:14+00:00 2026-06-07T08:02:14+00:00

I have a Hive query which is selecting about 30 columns and around 400,000

0

I have a Hive query which is selecting about 30 columns and around 400,000 records and inserting them into another table. I have one join in my SQL clause, which is just an inner join.

The query fails because of a Java GC overhead limit exceeded.

What’s strange is that if I remove the join clause and just select the data from the table (slightly higher volume) then the query works fine.

I’m pretty new to Hive. I can’t understand why this join is causing memory exceptions.

Is there something that I should be aware of with regards to how I write Hive queries so that they don’t cause these issues? Could anyone explain why the join might cause this issue but selecting a higher volume of data and the same number of columns does not.

Appreciate your thoughts on this.
Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T08:02:15+00:00

Many thanks for the response Mark. Much appreciated.

After many hours I eventually found out that the order of tables in the the join statement makes a difference. For optimum performance and memory management the last join should be the largest table.

Changing the order of my tables in the join statement fixed the issue.

See Largest Table Last at http://hive.apache.org/docs/r0.9.0/language_manual/joins.html

Your explanation above is very useful as well. Many Thanks

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a Hive query which is selecting about 30 columns and around 400,000

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply