I would like some suggestion on input for k-means clustering.
I am relatively new to this k-means clustering (or clustering for god sake) and found this source code:
k-means by Shyam Sivaraman
I might probably want to use this JAVA since my Supervisor wants me to just alter and apply the algorithm and not create it from scratch by myself.
So, according to the code:
Vector dataPoints = new Vector();
dataPoints.add(new DataPoint(22,21,"data1"));
dataPoints.add(new DataPoint(19,20,"data2"));
dataPoints.add(new DataPoint(18,22,"data3"));
...
What I know till now is that it accept two variable data point (x and y) and the data name, based on this following code:
public DataPoint(double x, double y, String name) {
this.mX = x;
this.mY = y;
this.mObjName = name;
Now what I want is to change the input to accept documents vector as I’m doing document clustering. Any suggestion on how to change the code? In words, if possible (code last option). Or if you guys found any link on this same topic, might as well share here.
Looking forward for any Suggestion guys.
In the simplest approach you’re have to calculate document-term matrix.
Your code doing clustering of vectors (x,y) in 2D space. You’re just have to extend it for N-dimensional space (according to dimension of vectors from document-term matrix).
Also I’m suggest to look at TF*IDF weighting, it could improve results of clustering.