I am reading a big file (more than a billion records) and joining it

Question

0

Asked: June 11, 20262026-06-11T10:13:13+00:00 2026-06-11T10:13:13+00:00

I am reading a big file (more than a billion records) and joining it

0

I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avoid multiple reads on the big table.The smalltables may not fit in memory.

A = join smalltable1 by  (f1,f2) RIGHT OUTER,massive by (f1,f2) ;
B = join smalltable2 by  (f3) RIGHT OUTER, A by (f3) ;
C = join smalltable3 by  (f4) ,B by (f4) ;

The alternative that I was thinking is to write a udf and replace values in one read, but I am not sure if a udf would be efficient since the small files won’t fit in the memory. The implementation could be like:

A = LOAD massive 
B = generate f1,udfToTranslateF1(f1),f2,udfToTranslateF2(f2),f3,udfToTranslateF3(f3)

Appreciate your thoughts…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T10:13:14+00:00

Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522

You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.

UPDATE 1
You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.

UPDATE 2
It was mentioned in the comments that the outer join is used for augmenting data.
In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am reading a big file (more than a billion records) and joining it

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply