I am running a Pig Script on a file which is about 1.22 GB. The default chunk size is 64MB. I have 4 Data Nodes. The Pig script as a output produces 19 files.
When I run the Pig script and see the Job Tracker I see 6 Jobs.
Jobid Priority User Name Map % Complete Map Total Maps Completed Reduce % Complete Reduce Total Reduces Completed Job Scheduling Information Diagnostic Info
job_201207121202_0001 NORMAL user PigLatin:Analysis.pig 100.00% 20 20 100.00% 1 1 NA NA
job_201207121202_0002 NORMAL user PigLatin:Analysis.pig 100.00% 5 5 100.00% 1 1 NA NA
job_201207121202_0003 NORMAL user PigLatin:Analysis.pig 100.00% 2 2 100.00% 1 1 NA NA
job_201207121202_0004 NORMAL user PigLatin:Analysis.pig 100.00% 2 2 100.00% 1 1 NA NA
job_201207121202_0005 NORMAL user PigLatin:Analysis.pig 100.00% 5 5 100.00% 1 1 NA NA
job_201207121202_0006 NORMAL user PigLatin:Analysis.pig 100.00% 5 5 100.00% 1 1 NA NA
According to my understanding, since the input file size is 1.22 GB, and chunk size is 64 MB, there will be totally 20 blocks created for the file. I have a replication factor of 3. Since I will have a map job for each split, I will have 20 Maps jobs and the job_201207121202_0001 Job says that in the list above and is perfect. However Why am i seeing other 5 jobs and totally other 19 Maps tasks?
Can anyone please help me understand this. I thought that It would just have 20 map and 1 reduce job since 1.22GB/64MB ~ 20.
I am a Pig/Hadoop Beginner. Help is really appreciated.
Pig compiles a script into multiple map-reduce jobs, depending on the semantics of the script. Roughly speaking, a join is a MR job. A group is an MR job. Order is 2 MR jobs (one to sample the distribution). There are a few other operators that produce MR boundaries.