Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8209611
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T09:47:06+00:00 2026-06-07T09:47:06+00:00

Hi I’m working on trying to cluster network data from the 1999 darpa data

  • 0

Hi I’m working on trying to cluster network data from the 1999 darpa data set. Unfortunately I’m not really getting clustered data, not compared to some of the literature, using the same techniques and methods.

My data comes out like this:

Matlab Figure 1

As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:

%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:

    mu = mean(data)
    sigma = std(data)
    [n,p] = size(data);
    % Create a matrix of mean values by replicating the mu vector for n rows
    MeanMat = repmat(mu,n,1);
    % Create a matrix of standard deviation values by replicating the sigma vector for n rows
    SigmaMat = repmat(sigma,n,1);
    % Create a matrix of zeros and ones, where ones indicate the location of outliers
    outliers = abs(data - MeanMat) > 3*SigmaMat;
    % Calculate the number of outliers in each column
    nout = sum(outliers) 
    % To remove an entire row of data containing the outlier
    data(any(outliers,2),:) = [];

In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.

This is the full script I used on the data:

    %% load data
        %# read the list of features
        fid = fopen('kddcup.names','rt');
        C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
        fclose(fid);

        %# determine type of features
        C{2} = regexprep(C{2}, '.$','');              %# remove "." at the end
        attribNom = [ismember(C{2},'symbolic');true]; %# nominal features

        %# build format string used to read/parse the actual data
        frmt = cell(1,numel(C{1}));
        frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
        frmt( ismember(C{2},'symbolic') ) = {'%s'};   %# nominal features: read as string
        frmt = [frmt{:}];
        frmt = [frmt '%s'];                           %# add the class attribute

        %# read dataset
        fid = fopen('kddcup.data_10_percent_corrected','rt');
        C = textscan(fid, frmt, 'Delimiter',',');
        fclose(fid);

        %# convert nominal attributes to numeric
        ind = find(attribNom);
        G = cell(numel(ind),1);
        for i=1:numel(ind)
            [C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
        end

        %# all numeric dataset
        fulldata = cell2mat(C);

%% dimensionality reduction 
columns = 6
[U,S,V]=svds(fulldata,columns);

%% randomly select dataset
rows = 1000;
columns = 6;

%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';

%# pick random columns
indY = indY(1:columns);

%# filter data
data = U(indX,indY);

% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));

% output matching data
dataSample = fulldata(indX, :)

%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:

mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers) 
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];

%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;

%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);

%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);

This is two distinct clusters from the output:

enter image description here

As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.

For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:

enter image description here

I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)

Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1’s running through-out have been removed (42 columns to 6 columns):

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

EDIT

Columns kept in the dataset are:

src_bytes: continuous.

dst_bytes: continuous.

count: continuous.

srv_count: continuous.  

dst_host_count: continuous.

dst_host_srv_count: continuous.         

RE-EDIT:

Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I’m hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T09:47:09+00:00Added an answer on June 7, 2026 at 9:47 am

    First things first: you’re asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn’t cost you 400 reputation!).

    Luckily for you, I understand your predicament, and just love this sort of question!

    Apart from this dataset’s possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let’s go ahead and get this solved.

    My suggestion is we edit this answer until you get reasonably satisfactory results.

    Number of clusters

    Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).

    Visual inspection

    It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you’ll need, just by looking.

    Pros:

    • quick
    • simple
    • works well on well-separated clusters in relatively small datasets

    Cons:

    • and dirty
    • requires user interaction
    • it’s easy to miss smaller clusters
    • data with less-well separated clusters, or very many of them, are hard to do by this method
    • it is all rather subjective — the next person might select a different amount than you did.

    silhouettes plot

    As indicated in one of your other questions, making a silhouettes plot will help you make a better decision about the proper number of clusters in your data.

    Pros:

    • relatively simple
    • reduces subjectivity by using statistical measures
    • intuitive way to represent quality of the choice

    Cons:

    • requires user interaction
    • In the limit, if you take as many clusters as there are datapoints, a silhouettes plot will tell you that that is the best choice
    • it is still rather subjective, not based on statistical means
    • can be computationally expensive

    elbow method

    As with the silhouettes plot approach, you run kmeans repeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by this kmeans run. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the “elbow”). The elbow is then statistically speaking the best choice for the number of clusters.

    Pros:

    • no user interaction required — the elbow can be selected automatically
    • statistically more sound than any of the aforementioned methods

    Cons:

    • somewhat complicated
    • still subjective, since the definition of the “elbow” depends on subjectively chosen parameters
    • can be computationally expensive

    Outliers

    Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.

    I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:

    data = your initial dataset
    dataMod = your initial dataset
    
    MAX = the number of clusters chosen by visual inspection
    
    while (forever)
    
        for N = MAX-5 : MAX+5
            if (N < 1), continue, end
            perform k-means with N clusters on dataMod
            if (variance explained shows a jump)
                break
        end
    
        if (you are satisfied)
            break
        end
    
        for i = 1:N
            extract all points from cluster i 
            find the centroid (let k-means do that)
            calculate the standard deviation of distances to the centroid
            mark points further than 3 sigma as possible outliers
        end
    
        dataMod = data with marked points removed
    
    end
    

    The tough part is obviously determining whether you are satisfied.
    This is the key to the algorithm’s effectiveness. The rough structure of
    this part

    if (you are satisfied)
        break
    end
    

    would be something like this

    if (situation has improved)
        data = dataMod
    
    elseif (situation is same or worse)
        dataMod = data
        break            
    end
    

    the situation has improved when there are fewer outliers, or the variance
    explaned for ALL choices of N is better than during the previous loop in the while. This is also something to fiddle with.

    Anyway, much more than a first attempt I wouldn’t call this.
    If anyone sees incompletenesses, flaws or loopholes here, please
    comment or edit.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
link Im having trouble converting the html entites into html characters, (&# 8217;) i
For some reason, after submitting a string like this Jack’s Spindle from a text
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I am trying to render a haml file in a javascript response like so:
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have a text area in my form which accepts all possible characters from
Does anyone know how can I replace this 2 symbol below from the string

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.