I have a Hive query which is selecting about 30 columns and around 400,000 records and inserting them into another table. I have one join in my SQL clause, which is just an inner join.
The query fails because of a Java GC overhead limit exceeded.
What’s strange is that if I remove the join clause and just select the data from the table (slightly higher volume) then the query works fine.
I’m pretty new to Hive. I can’t understand why this join is causing memory exceptions.
Is there something that I should be aware of with regards to how I write Hive queries so that they don’t cause these issues? Could anyone explain why the join might cause this issue but selecting a higher volume of data and the same number of columns does not.
Appreciate your thoughts on this.
Thanks
Many thanks for the response Mark. Much appreciated.
After many hours I eventually found out that the order of tables in the the join statement makes a difference. For optimum performance and memory management the last join should be the largest table.
Changing the order of my tables in the join statement fixed the issue.
See Largest Table Last at http://hive.apache.org/docs/r0.9.0/language_manual/joins.html
Your explanation above is very useful as well. Many Thanks