Is there anyway by which each reducer process could determine the number of elements or records it has to process ?
Is there anyway by which each reducer process could determine the number of elements
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Your reducer class must extend the MapReducer Reduce class:
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>and then must implement the reduce method using the KEYIN/VALUEIN arguments specified in the extended Reduce class
reduce(KEYIN key, Iterable<VALUEIN> values,org.apache.hadoop.mapreduce.Reducer.Context context)
The values associated with a given key can be counted via
Though I’d propose doing this counting along side your other processing as to not make two passes through your value set.
EDIT
Here’s an example vector of vectors that will dynamically grow as you add to it (so you won’t have to statically declare your arrays, and hence don’t need the size of the values set). This will work best for non-regular data (IE the number of columns is not the same for every row in your input csv file), but will have the most overhead.
Then you can access the Mth column of the Nth row via
Now, if you knew the # of columns would be set, you could modify this to use a Vector of arrays which would probably be a little faster/more space efficient.