Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8534541
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T10:13:13+00:00 2026-06-11T10:13:13+00:00

I am reading a big file (more than a billion records) and joining it

  • 0

I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avoid multiple reads on the big table.The smalltables may not fit in memory.

A = join smalltable1 by  (f1,f2) RIGHT OUTER,massive by (f1,f2) ;
B = join smalltable2 by  (f3) RIGHT OUTER, A by (f3) ;
C = join smalltable3 by  (f4) ,B by (f4) ;

The alternative that I was thinking is to write a udf and replace values in one read, but I am not sure if a udf would be efficient since the small files won’t fit in the memory. The implementation could be like:

A = LOAD massive 
B = generate f1,udfToTranslateF1(f1),f2,udfToTranslateF2(f2),f3,udfToTranslateF3(f3)

Appreciate your thoughts…

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T10:13:14+00:00Added an answer on June 11, 2026 at 10:13 am

    Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522

    You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.

    UPDATE 1
    You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.

    UPDATE 2
    It was mentioned in the comments that the outer join is used for augmenting data.
    In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a big file transfer (say 4gb or so) and rather than using
I have a very big array stored in memory after reading a whole file
I'm reading a big file using fread. When I interrupt the program during it
I am reading a big file in a Java program, using http access. I
this is somewhat related to my earlier query.. Reading A Big File With Python
I need to read a big file by reading at most N lines at
I am sorting big file with by reading into chunks (Arraylist), sorting each arraylist
Lately I have been reading a lot of blog topics about big sites(facebook, twitter,
I have a Boost.MultiIndex big array about 10Gb. In order to reduce the reading
Reading the MCIMX50 Application Processor Reference Manuals on page 1368 (Section 33.3) there is

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.