I am using the sparcl package written by Witten and Tibshirani based on their paper:
Witten DM and R Tibshirani (2010) A framework for feature selection in clustering. Journal of the American Statistical Association 105(490): 713-726
I look into the example under the function HierarchicalSparseCluster:
# Generate 2-class data
set.seed(1)
x <- matrix(rnorm(100*50),ncol=50)
y <- c(rep(1,50),rep(2,50))
x[y==1,1:25] <- x[y==1,1:25]+2
# Do tuning parameter selection for sparse hierarchical clustering
perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),nperms=5)
# Perform sparse hierarchical clustering
sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, method="complete")
Now I check dim(sparsehc$dists) and it returns 4950 and 50. From the simulation set-up, we know that n=100 and p=50. Also, according to the manual, the returned value dists is a (n*n)xp dissimilarity matrix for the data matrix x. Obviously the row dimension is not n*n as it should be 100*100=10000 instead of 4950. Did I misunderstand something? Thank you very much!
It seems to be the mistake in
sparclhelp page: dimensions of dissimilarity matrixdistaren2xp, wheren2=n*(n-1)/2. Indeed, we don’t neednxnmatrix of distances, but only part of this matrix over the main diagonal.Sources of
sparclsupport what I said above:distfun.R
Here we can see how
n2is calculated and passed to Fortran function.distfun.f
Here for each feature in
distmatrix there is a column of sizen2constructed, that holds a sequence of pairwise distances between objects. For example, forn=4,p=2andn2=4*3/2=6the final matrix will be6x2and designed like this:Where, say,
d(2,4)_1is a distance between 2nd and 4th object for 1st feature.