Hi I’m working on trying to cluster network data from the 1999 darpa data set. Unfortunately I’m not really getting clustered data, not compared to some of the literature, using the same techniques and methods.
My data comes out like this:

As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:
%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 3*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.
This is the full script I used on the data:
%% load data
%# read the list of features
fid = fopen('kddcup.names','rt');
C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
fclose(fid);
%# determine type of features
C{2} = regexprep(C{2}, '.$',''); %# remove "." at the end
attribNom = [ismember(C{2},'symbolic');true]; %# nominal features
%# build format string used to read/parse the actual data
frmt = cell(1,numel(C{1}));
frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
frmt( ismember(C{2},'symbolic') ) = {'%s'}; %# nominal features: read as string
frmt = [frmt{:}];
frmt = [frmt '%s']; %# add the class attribute
%# read dataset
fid = fopen('kddcup.data_10_percent_corrected','rt');
C = textscan(fid, frmt, 'Delimiter',',');
fclose(fid);
%# convert nominal attributes to numeric
ind = find(attribNom);
G = cell(numel(ind),1);
for i=1:numel(ind)
[C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
end
%# all numeric dataset
fulldata = cell2mat(C);
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick random columns
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
% output matching data
dataSample = fulldata(indX, :)
%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
This is two distinct clusters from the output:

As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.
For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:

I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)
Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1’s running through-out have been removed (42 columns to 6 columns):
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
EDIT
Columns kept in the dataset are:
src_bytes: continuous.
dst_bytes: continuous.
count: continuous.
srv_count: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
RE-EDIT:
Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I’m hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.
First things first: you’re asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn’t cost you 400 reputation!).
Luckily for you, I understand your predicament, and just love this sort of question!
Apart from this dataset’s possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let’s go ahead and get this solved.
My suggestion is we edit this answer until you get reasonably satisfactory results.
Number of clusters
Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).
Visual inspection
It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you’ll need, just by looking.
Pros:
Cons:
silhouettes plot
As indicated in one of your other questions, making a
silhouettesplot will help you make a better decision about the proper number of clusters in your data.Pros:
Cons:
elbow method
As with the silhouettes plot approach, you run
kmeansrepeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by thiskmeansrun. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the “elbow”). The elbow is then statistically speaking the best choice for the number of clusters.Pros:
Cons:
Outliers
Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.
I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:
The tough part is obviously determining whether
you are satisfied.This is the key to the algorithm’s effectiveness. The rough structure of
this part
would be something like this
the
situation has improvedwhen there are fewer outliers, or the varianceexplaned for ALL choices of
Nis better than during the previous loop in thewhile. This is also something to fiddle with.Anyway, much more than a first attempt I wouldn’t call this.
If anyone sees incompletenesses, flaws or loopholes here, please
comment or edit.