Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7951535
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T02:35:28+00:00 2026-06-04T02:35:28+00:00

I am trying to implement K means in hadoop-1.0.1 in java language. I am

  • 0

I am trying to implement K means in hadoop-1.0.1 in java language. I am frustrated now. Although I got a github link of the complete implementation of k means but as a newbie in Hadoop, I want to learn it without copying other’s code. I have basic knowledge of map and reduce function available in hadoop. Can somebody provide me the idea to implement k means mapper and reducer class. Does it require iteration?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T02:35:30+00:00Added an answer on June 4, 2026 at 2:35 am

    Ok I give it a go to tell you what I thought when implementing k-means in MapReduce.
    This implementation differs from that of Mahout, mainly because it is to show how the algorithm could work in a distributed setup (and not for real production usage).
    Also I assume that you really know how k-means works.

    That having said we have to divide the whole algorithm into three main stages:

    1. Job level
    2. Map level
    3. Reduce level

    The Job Level

    The job level is fairly simple, it is writing the input (Key = the class called ClusterCenter and Value = the class called VectorWritable), handling the iteration with the Hadoop job and reading the output of the whole job.

    VectorWritable is a serializable implementation of a vector, in this case from my own math library, but actually nothing else than a simple double array.

    The ClusterCenter is mainly a VectorWritable, but with convenience functions that a center usually needs (averaging for example).

    In k-means you have some seedset of k-vectors that are your initial centers and some input vectors that you want to cluster. That is exactly the same in MapReduce, but I am writing them to two different files. The first file only contains the vectors and some dummy key center and the other file contains the real initial centers (namely cen.seq).

    After all that is written to disk you can start your first job. This will of course first launch a Mapper which is the next topic.

    The Map Level

    In MapReduce it is always smart to know what is coming in and what is going out (in terms of objects).
    So from the job level we know that we have ClusterCenter and VectorWritable as input, whereas the ClusterCenter is currently just a dummy. For sure we want to have the same as output, because the map stage is the famous assignment step from normal k-means.

    You are reading the real centers file you created at job level to memory for comparision between the input vectors and the centers. Therefore you have this distance metric defined, in the mapper it is hardcoded to the ManhattanDistance.
    To be a bit more specific, you get a part of your input in map stage and then you get to iterate over each input “key value pair” (it is a pair or tuple consisting of key and value) comparing with each of the centers. Here you are tracking which center is the nearest and then assign it to the center by writing the nearest ClusterCenter object along with the input vector itself to disk.

    Your output is then: n-vectors along with their assigned center (as the key).
    Hadoop is now sorting and grouping by your key, so you get every assigned vector for a single center in the reduce task.

    The Reduce Level

    As told above, you will have a ClusterCenter and its assigned VectorWritable‘s in the reduce stage.
    This is the usual update step you have in normal k-means. So you are simply iterating over all vectors, summing them up and averaging them.

    Now you have a new “Mean” which you can compare to the mean it was assigned before. Here you can measure a difference between the two centers which tells us about how much the center moved. Ideally it wouldn’t have moved and converged.

    The counter in Hadoop is used to track this convergence, the name is a bit misleading because it actually tracks how many centers have not converged to a final point, but I hope you can live with it.

    Basically you are writing now the new center and all the vectors to disk again for the next iteration. In addition in the cleanup step, you are writing all the new gathered centers to the path used in the map step, so the new iteration has the new vectors.


    Now back at the job stage, the MapReduce job should be done now. Now we are inspecting the counter of that job to get the number of how many centers haven’t converged yet.
    This counter is used at the while loop to determine if the whole algorithm can come to an end or not.
    If not, return to the Map Level paragraph again, but use the output from the previous job as the input.

    Actually this was the whole VooDoo.

    For obvious reasons this shouldn’t be used in production, because its performance is horrible. Better use the more tuned version of Mahout. But for educational purposes this algorithm is fine 😉

    If you have any more questions, feel free to write me a mail or comment.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying implement pull to refresh on a ListFragment but right now none of
I trying to implement the active record pattern using Java/JDBC and MySQL along with
I'm trying to implement a Shift Cipher, which means, shift every character in a
I'm trying to implement Java server and C client. I get java.lang.NullPointerException when I
I'm trying to implement AQRecorder.h class from SpeakHere Apple Xcode project example, but even
Hi I am trying to implement hasPathSum() means for given number is there any
I'm trying to implement a Plurk API biding for Scala, but I have a
I am trying to implement so guard handling with Caliburn.Micro but I am getting
I am trying to implement the simplest shared 'files' folder for a website but
I'm trying to implement a critical section in CUDA using atomic instructions, but I

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.