I am doing unsupervised classification. For this I have 8 features (Variance of Green, Std. div. of Green , Mean of Red, Variance of Red, Std. div. of Red, Mean of Hue, Variance of Hue, Std. div. of Hue) for classification per image and I want to select 3 most significant features using PCA. I have written the following code for feature selection
(where dimension of feature is : 179X8) :
for c=1:size(feature,1)
feature(c,:)=feature(c,:)-mean(feature)
end
DataCov=cov(feature); % covariance matrix
[PC,variance,explained] = pcacov(DataCov)
This gives me :
PC =
0.0038 -0.0114 0.0517 0.0593 0.0039 0.3998 0.9085 -0.0922
0.0755 -0.1275 0.6339 0.6824 -0.3241 -0.0377 -0.0641 0.0052
0.7008 0.7113 -0.0040 0.0496 -0.0207 0.0042 0.0012 0.0002
0.0007 -0.0012 0.0051 0.0101 0.0272 0.0288 0.0873 0.9953
0.0320 -0.0236 0.1521 0.2947 0.9416 -0.0142 -0.0289 -0.0266
0.7065 -0.6907 -0.1282 -0.0851 0.0060 0.0003 0.0010 -0.0001
0.0026 -0.0037 0.0632 -0.0446 0.0053 0.9125 -0.4015 0.0088
0.0543 -0.0006 0.7429 -0.6574 0.0838 -0.0705 0.0311 -0.0001
variance =
0.0179
0.0008
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
explained =
94.9471
4.1346
0.6616
0.2358
0.0204
0.0003
0.0002
0.0000
This means first principle component has 94.9% variance explained and so on … but these are in order of most to least significant.
How can I know which features (from 1 to 8) to be selected based on above information.
Your problem is the same as the COLUMNSELECT problem discussed by Mahoney and Drineas in “CUR matrix decompositions for improved data analysis”.
They first compute the leverage scores for each dimension and then selects 3 of them randomly using the leverage scores as weights. Alternatively, you can select the largest ones. Here’s the script for your problem:
I first got a real nature image from the web and resized it to the dimensions you ask. The image is as follows:
Then, compute the centralized data:
I use SVD to compute PCA since it gives you both the principal components and the coefficients. If the samples are in columns, then
Uholds the principal components. Check the second page of this paper for the relationship.The key idea here is that we want to get the dimensions having most of the variation. And an assumption is that there’s some noise in data. We select only the dominant eigenvectors, e.g. representing the 95% of the data.
Then the leverage scores are computed using
nEigof the principal components. That is, we take the norm of thenEigcoefficients.Then, we can sort the leverage scores:
and get the indices of the vectors with the largest leverage scores:
You can check the paper for more details.
But, keep in mind that PCA-based technique is good if you have many many dimensions. In your case, the search space is very small. My advice is to search exhaustively in the space and get the best selection as @amit recommends.