I am writing a mapReduce job that find the k objects from a huge

Question

0

Asked: June 8, 20262026-06-08T08:47:00+00:00 2026-06-08T08:47:00+00:00

I am writing a mapReduce job that find the k objects from a huge

0

I am writing a mapReduce job that find the k objects from a huge dataset that have the lowest distances from a point.

In my mapper, I want to report only the k object with lowest distances for that block of data. this way, I have k intermediate(key,value) for each block of data where key is the distance and value is the object_id. So In my reducer() I can process and summarize the k lowest values easily.

I can’t think of a way to only report the intermediate key-value pairs for the k object with lowest distance from a point for one block of data in my mapper class?

I know that I can return as intermediate key-value pair the (distance,obj_id) for all the input data in that data block and then reduce that in my reducer class and get the same result. But k << (No. of data in each data block) and by reporting only k intermediate key-value instead of all, I can significantly reduce the amount of data transfer/shuffling.

Any help is appreciated

thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T08:47:01+00:00

Assuming that k is small (you can fit this number of objects in memory), then this should be easy enough:

Create a wrapper / container object that contains two instance variables – the calculated distance (float/double?) and the object_id (Text?)
There are a number of possible ways to maintain a fixed set of values, but for this example lets use a TreeSet (of your wrapper object type)
Either ensure your wrapper object implements the Comparable interface, or create a Comparator implementation that can be used by the TreeSet to determine order – the implementation should first compare the distance instance variable, and if they are the same, then compare the object IDs (this leads to an interesting question – what do you want to happen if you want to retain the smallest 10 values, but there are 20 values all with the smallest distance – which 10 do you want to keep?)
As you process values in your mapper, calculate the distance value, and if either the treeset size is smaller than K, or the distance is smaller than the set’s tail value distance, then add in this distance/obj_id pair (either creating a new instance of your wrapper if the set size is less than k, or evicting the tail value and re-using it to host the new distance / obj id (be sure to remove it from the set, amend the values, then re-add)
In the cleanup method of your mapper, output the tree set of values, one at a time.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a mapReduce job that find the k objects from a huge

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply