Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8161511
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T18:30:53+00:00 2026-06-06T18:30:53+00:00

I have a skewed data set and I need to do a group by

  • 0

I have a skewed data set and I need to do a group by operation and then do a nested foreach on it. Because of the skewed data, few reducers are taking long time and others are taking no time. I know there exists skewed join but what is there for group by and foreach? Here is my pig code (renamed the variables):

foo_grouped = GROUP foo_grouped by FOO;
FOO_stats = FOREACH foo_grouped 
{ 
a_FOO_total = foo_grouped.ATTR; 
a_FOO_total = DISTINCT a_FOO_total; 

bar_count = foo_grouped.BAR; 
bar_count = DISTINCT bar_count; 

a_FOO_type1 = FILTER foo_grouped by COND1=='Y';
a_FOO_type1 = a_FOO_type1.ATTR; 
a_FOO_type1 = DISTINCT a_FOO_type1;

a_FOO_type2 = FILTER foo_grouped by COND2=='Y' OR COND3=='HIGH'; 
a_FOO_type2 = a_FOO_type2.ATTR; 
a_FOO_type2 = DISTINCT a_FOO_type2; 

generate group as FOO, 
COUNT(a_FOO_total) as a_FOO_total, COUNT(a_FOO_type1) as a_FOO_type1, COUNT(a_FOO_type2)     as a_FOO_type2, COUNT(bar_count) as bar_count; }
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T18:30:54+00:00Added an answer on June 6, 2026 at 6:30 pm

    In your example there are a lot of nested DISTINCT operators within FOREACH which are executed in the reducer, it relies on RAM to calculate unique values and this query produces just one Job. In case of just too many unique elements in a group you could get memory related exceptions as well.

    Luckily PIG Latin is a dataflow language and you write sort of execution plan. In order to utilize more CPUs you could change your code in such way that forces more MapReduce jobs which could be executed in parallel. For that we should rewrite query without using nested DISTINCT, the trick is to do distinct operations and than group by as if you had just one column and than merge the results. It is very SQL like, but it works. Here it is:

    records = LOAD '....' USING PigStorage(',') AS (g, a, b, c, d, fd, s, w);
    selected = FOREACH records GENERATE g, a, b, c, d;
    grouped_a = FOREACH selected GENERATE g, a;
    grouped_a = DISTINCT grouped_a;
    grouped_a_count = GROUP grouped_a BY g;
    grouped_a_count = FOREACH grouped_a_count GENERATE FLATTEN(group) as g, COUNT(grouped_a) as a_count;
    
    grouped_b = FOREACH selected GENERATE g, b;
    grouped_b = DISTINCT grouped_b;
    grouped_b_count = GROUP grouped_b BY g;
    grouped_b_count = FOREACH grouped_b_count GENERATE FLATTEN(group) as g, COUNT(grouped_b) as b_count;
    
    grouped_c = FOREACH selected GENERATE g, c;
    grouped_c = DISTINCT grouped_c;
    grouped_c_count = GROUP grouped_c BY g;
    grouped_c_count = FOREACH grouped_c_count GENERATE FLATTEN(group) as g, COUNT(grouped_c) as c_count;
    
    grouped_d = FOREACH selected GENERATE g, d;
    grouped_d = DISTINCT grouped_d;
    grouped_d_count = GROUP grouped_d BY g;
    grouped_d_count = FOREACH grouped_d_count GENERATE FLATTEN(group) as g, COUNT(grouped_d) as d_count;
    
    mrg = JOIN grouped_a_count BY g, grouped_b_count BY g, grouped_c_count BY g, grouped_d_count BY g;
    out = FOREACH mrg GENERATE grouped_a_count::g, grouped_a_count::a_count, grouped_b_count::b_count, grouped_c_count::c_count, grouped_d_count::d_count;
    STORE out into '....' USING PigStorage(',');
    

    After execution I got following summary which shows that distinct operations did not suffer from the skew in data were processed by the first Job:

    Job Stats (time in seconds):
          JobId            Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
    job_201206061712_0244   669     45      75      8       13      376     18      202     grouped_a,grouped_b,grouped_c,grouped_d,records,selected        DISTINCT,MULTI_QUERY
    job_201206061712_0245   1       1       3       3       3       12      12      12      grouped_c_count GROUP_BY,COMBINER
    job_201206061712_0246   1       1       3       3       3       12      12      12      grouped_b_count GROUP_BY,COMBINER
    job_201206061712_0247   5       1       48      27      33      30      30      30      grouped_a_count GROUP_BY,COMBINER
    job_201206061712_0248   1       1       3       3       3       12      12      12      grouped_d_count GROUP_BY,COMBINER
    job_201206061712_0249   4       1       3       3       3       12      12      12      mrg,out HASH_JOIN       ...,
    Input(s):
    Successfully read 52215768 records (44863559501 bytes) from: "...."
    
    Output(s):
    Successfully stored 9 records (181 bytes) in: "..."
    

    From Job DAG we can see that groupby operations were executed in parallel:

    Job DAG:
    job_201206061712_0244   ->      job_201206061712_0248,job_201206061712_0246,job_201206061712_0247,job_201206061712_0245,
    job_201206061712_0248   ->      job_201206061712_0249,
    job_201206061712_0246   ->      job_201206061712_0249,
    job_201206061712_0247   ->      job_201206061712_0249,
    job_201206061712_0245   ->      job_201206061712_0249,
    job_201206061712_0249
    

    It works fine on my datasets where one of the group key values (in column g) makes 95% of the data. It also gets rid of memory related exceptions.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a bunch of sets of data (between 50 to 500 points, each
I have a large collection of scanned images, and they are all somewhat skewed,
I have a simple data logging table with two data columns A and B
I need to have a <textarea> take up all available space inside of a
I have made a fiddle: http://jsfiddle.net/89x4d/ I'm trying to maintain the skewed div but
have written this little class, which generates a UUID every time an object of
Have a procedure which looks like Procedure TestProc(TVar1, TVar2 : variant); Begin TVar1 :=
Have done quite a bit of searching for a guide (of any substance) for
Have deployed numerous report parts which reference the same view however one of them
have a problem. At first look at this HTML <div id=map style=background-image: url(map.png); width:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.