Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8277761
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T08:47:00+00:00 2026-06-08T08:47:00+00:00

I am writing a mapReduce job that find the k objects from a huge

  • 0

I am writing a mapReduce job that find the k objects from a huge dataset that have the lowest distances from a point.

In my mapper, I want to report only the k object with lowest distances for that block of data. this way, I have k intermediate(key,value) for each block of data where key is the distance and value is the object_id. So In my reducer() I can process and summarize the k lowest values easily.

I can’t think of a way to only report the intermediate key-value pairs for the k object with lowest distance from a point for one block of data in my mapper class?

I know that I can return as intermediate key-value pair the (distance,obj_id) for all the input data in that data block and then reduce that in my reducer class and get the same result. But k << (No. of data in each data block) and by reporting only k intermediate key-value instead of all, I can significantly reduce the amount of data transfer/shuffling.

Any help is appreciated

thanks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T08:47:01+00:00Added an answer on June 8, 2026 at 8:47 am

    Assuming that k is small (you can fit this number of objects in memory), then this should be easy enough:

    • Create a wrapper / container object that contains two instance variables – the calculated distance (float/double?) and the object_id (Text?)
    • There are a number of possible ways to maintain a fixed set of values, but for this example lets use a TreeSet (of your wrapper object type)
    • Either ensure your wrapper object implements the Comparable interface, or create a Comparator implementation that can be used by the TreeSet to determine order – the implementation should first compare the distance instance variable, and if they are the same, then compare the object IDs (this leads to an interesting question – what do you want to happen if you want to retain the smallest 10 values, but there are 20 values all with the smallest distance – which 10 do you want to keep?)
    • As you process values in your mapper, calculate the distance value, and if either the treeset size is smaller than K, or the distance is smaller than the set’s tail value distance, then add in this distance/obj_id pair (either creating a new instance of your wrapper if the set size is less than k, or evicting the tail value and re-using it to host the new distance / obj id (be sure to remove it from the set, amend the values, then re-add)
    • In the cleanup method of your mapper, output the tree set of values, one at a time.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm writing a mapreduce job, and I have the input that I want to
I'm writing a mapreduce job over HBase using table mapper. I want to skip
I'm writing a MapReduce job that may end up with a huge number of
I currently have a MapReduce job that uses MultipleOutputs to send data to several
Writing htaccess that allows me to remove index.php from the URL can confuse search
When writing a MapReduce job (specifically Hadoop if relevant), one must define a map()
Writing a file utility to strip out all non-ASCII characters from files. I have
I am writing a mapreduce program that uses multiple I/O pipes (one pipe per
Writing a test app to emulate PIO lines, I have a very simple Python/Tk
Writing documentation in html requires some code examples. What to do with characters that

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.