Ok I will run down what im trying to achieve and how I tryed to achieve it then I will explain why I tryed this method.
I have data from the KDD cup 1999 in its original format the data has 494k of rows with 42 columns.
My goal is trying to cluster this data unsupervised. From a previous question here:
I recieved this feedback:
For starters, you need to normalize the attributes to be of the same
scale: when computing the euclidean distance as part of step 3 in your
method, the features with values such as 239 and 486 will dominate
over the other features with small values as 0.05, thus disrupting the
result.Another point to remember is that too many attributes can be a bad
thing (curse of dimensionality). Thus you should look into feature
selection or dimensionality reduction techniques.
So the first thing I went about doing was addressing the feature selection which is related to this article: http://narensportal.com/papers/datamining-classification-algorithm.aspx#_sec-2-1
and looks like this after selecting the necessary features:

So for the clustering I removed the discrete values which left me with 3 columns with numeric data, I then went about removing the duplicate rows see: junk, index and unique on a matrix (how to keep matrix format) in the file which reduced the 3 columns from 494k to 67k which was done like so:
[M,ind] = unique(data, 'rows', 'first');
[~,ind] = sort(ind);
M = M(ind,:);
I then used the random permutation to reduce the file size from 67k to 1000 like so:
m = 1000;
n = 3;
%# pick random rows
indX = randperm( size(M,1) );
indX = indX(1:m);
%# pick random columns
indY = randperm( size(M,2) );
indY = indY(1:n);
%# filter data
data = M(indX,indY)
So now I have a file with 3 of my features which I selected I have removed duplicate records and used the random permutation to further reduce the dataset my last goal was to normalize this data and I did this with:
normalized_data = data/norm(data);
I then used the following K-means script:
%% generate clusters
K = 4;
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
But my results are still coming out like my original question I asked here: clustering and matlab
Here is what the data looks like when plotted:

and:

Can anyone help solve this problem, are the methods im using not the correct methods or is there something im missing?
Just like to say thanks to cyborg and Amro for helping, I realized that rather than create my own pre-processing I kept the dimensions as such and I finally managed to get some clustered data!
Out put!
Ofcourse I still have some outliers but if I could get rid of them and plot the graph from -0.2 – 0.2 im sure it would look alot better. But if you look at the original attempt I seem to be getting there!