Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6698059
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T06:31:59+00:00 2026-05-26T06:31:59+00:00

Ok I will run down what im trying to achieve and how I tryed

  • 0

Ok I will run down what im trying to achieve and how I tryed to achieve it then I will explain why I tryed this method.

I have data from the KDD cup 1999 in its original format the data has 494k of rows with 42 columns.

My goal is trying to cluster this data unsupervised. From a previous question here:

clustering and matlab

I recieved this feedback:

For starters, you need to normalize the attributes to be of the same
scale: when computing the euclidean distance as part of step 3 in your
method, the features with values such as 239 and 486 will dominate
over the other features with small values as 0.05, thus disrupting the
result.

Another point to remember is that too many attributes can be a bad
thing (curse of dimensionality). Thus you should look into feature
selection or dimensionality reduction techniques.

So the first thing I went about doing was addressing the feature selection which is related to this article: http://narensportal.com/papers/datamining-classification-algorithm.aspx#_sec-2-1

and looks like this after selecting the necessary features:

enter image description here

So for the clustering I removed the discrete values which left me with 3 columns with numeric data, I then went about removing the duplicate rows see: junk, index and unique on a matrix (how to keep matrix format) in the file which reduced the 3 columns from 494k to 67k which was done like so:

[M,ind] = unique(data, 'rows', 'first');
[~,ind] = sort(ind);
M = M(ind,:);

I then used the random permutation to reduce the file size from 67k to 1000 like so:

m = 1000;
n = 3;

%# pick random rows
indX = randperm( size(M,1) );
indX = indX(1:m);

%# pick random columns
indY = randperm( size(M,2) );
indY = indY(1:n);

%# filter data
data = M(indX,indY)

So now I have a file with 3 of my features which I selected I have removed duplicate records and used the random permutation to further reduce the dataset my last goal was to normalize this data and I did this with:

normalized_data = data/norm(data);

I then used the following K-means script:

%% generate clusters
K = 4;

%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);

%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K);     % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end

% find  for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)

But my results are still coming out like my original question I asked here: clustering and matlab

Here is what the data looks like when plotted:

enter image description here

and:

enter image description here

Can anyone help solve this problem, are the methods im using not the correct methods or is there something im missing?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T06:31:59+00:00Added an answer on May 26, 2026 at 6:31 am

    Just like to say thanks to cyborg and Amro for helping, I realized that rather than create my own pre-processing I kept the dimensions as such and I finally managed to get some clustered data!

    Out put!

    enter image description here

    Ofcourse I still have some outliers but if I could get rid of them and plot the graph from -0.2 – 0.2 im sure it would look alot better. But if you look at the original attempt I seem to be getting there!

      %% load data
        %# read the list of features
        fid = fopen('kddcup.names','rt');
        C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
        fclose(fid);
    
        %# determine type of features
        C{2} = regexprep(C{2}, '.$','');              %# remove "." at the end
        attribNom = [ismember(C{2},'symbolic');true]; %# nominal features
    
        %# build format string used to read/parse the actual data
        frmt = cell(1,numel(C{1}));
        frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
        frmt( ismember(C{2},'symbolic') ) = {'%s'};   %# nominal features: read as string
        frmt = [frmt{:}];
        frmt = [frmt '%s'];                           %# add the class attribute
    
        %# read dataset
        fid = fopen('kddcup.data_10_percent_corrected','rt');
        C = textscan(fid, frmt, 'Delimiter',',');
        fclose(fid);
    
        %# convert nominal attributes to numeric
        ind = find(attribNom);
        G = cell(numel(ind),1);
        for i=1:numel(ind)
            [C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
        end
    
        %# all numeric dataset
        fulldata = cell2mat(C);
        %% dimensionality reduction 
        columns = 42
        [U,S,V]=svds(fulldata,columns)
        %% randomly select dataset
        rows = 5000;
        %# pick random rows
        indX = randperm( size(fulldata,1) );
        indX = indX(1:rows);
        %# pick random columns
        indY = randperm( size(fulldata,2) );
        indY = indY(1:columns);
        %# filter data
        data = U(indX,indY)
        %% apply normalization method to every cell
        data = data./repmat(sqrt(sum(data.^2)),size(data,1),1)
        %% generate sample data
        K = 4;
        numObservarations = 5000;
        dimensions = 42;
        %% cluster
        opts = statset('MaxIter', 500, 'Display', 'iter');
        [clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
        'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
        %% plot data+clusters
        figure, hold on
        scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
        scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
        hold off, xlabel('x'), ylabel('y'), zlabel('z')
        %% plot clusters quality
        figure
        [silh,h] = silhouette(data, clustIDX);
        avrgScore = mean(silh);
        %% Assign data to clusters
        % calculate distance (squared) of all instances to each cluster centroid
        D = zeros(numObservarations, K);     % init distances
        for k=1:K
        %d = sum((x-y).^2).^0.5
        D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
        end
        % find  for all instances the cluster closet to it
        [minDists, clusterIndices] = min(D, [], 2);
        % compare it with what you expect it to be
        sum(clusterIndices == clustIDX)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have c# that will run in a windows service. I'm trying to use
I have an issue with a query I'm trying to run on some data,
This is what I'm trying to achieve. If All is selected from the drop
The app will run fine, then crash - literally every other time. It seems
If you have a web application that will run inside a network, it makes
I have a button that when clicked will run a stored procedure on a
I'm trying to write a Sinatra app that will run on a shared Passenger
I am trying to get a sql query to run on drop down menu
I'm trying to write down a script to fetch some online data; script should
I'm trying to run a very simple integration test and keep getting this error:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.