I am writing a mapReduce job that find the k objects from a huge dataset that have the lowest distances from a point.
In my mapper, I want to report only the k object with lowest distances for that block of data. this way, I have k intermediate(key,value) for each block of data where key is the distance and value is the object_id. So In my reducer() I can process and summarize the k lowest values easily.
I can’t think of a way to only report the intermediate key-value pairs for the k object with lowest distance from a point for one block of data in my mapper class?
I know that I can return as intermediate key-value pair the (distance,obj_id) for all the input data in that data block and then reduce that in my reducer class and get the same result. But k << (No. of data in each data block) and by reporting only k intermediate key-value instead of all, I can significantly reduce the amount of data transfer/shuffling.
Any help is appreciated
thanks
Assuming that k is small (you can fit this number of objects in memory), then this should be easy enough: