I’m using dendrogram from scipy to plot hierarchical clustering using matplotlib as follows:
mat = array([[1, 0.5, 0.9],
[0.5, 1, -0.5],
[0.9, -0.5, 1]])
plt.subplot(1,2,1)
plt.title("mat")
dist_mat = mat
linkage_matrix = linkage(dist_mat,
"single")
print "linkage2:"
print linkage(1-dist_mat, "single")
dendrogram(linkage_matrix,
color_threshold=1,
labels=["a", "b", "c"],
show_leaf_counts=True)
plt.subplot(1,2,2)
plt.title("1 - mat")
dist_mat = 1 - mat
linkage_matrix = linkage(dist_mat,
"single")
dendrogram(linkage_matrix,
color_threshold=1,
labels=["a", "b", "c"],
show_leaf_counts=True)
My questions are: first, why does mat and 1-mat give identical clusterings here? and second, how can I annotate the distance along each branch of the tree using dendrogram so that the distances between pairs of nodes can be compared?
finally it seems that show_leaf_counts flag is ignored, is there a way to turn it on so that the number of objects in each class is shown? thanks.
The input to
linkage()is either an n x m array, representing n points inm-dimensional space, or a one-dimensional array containing the condensed distance matrix. In your example,
matis 3 x 3, so you are clusteringthree 3-d points. Clustering is based on the distance between these points.
Why does mat and 1-mat give identical clusterings here?
The arrays
matand1-matproduce the same clustering because the clusteringis based on distances between the points, and neither a reflection (
-mat)nor a translation (
mat + offset) of the entire data set change the relativedistances between the points.
How can I annotate the distance along each branch of the tree using dendrogram so that the distances between pairs of nodes can be compared?
In the code below, I
show how you can use the data returned by dendrogram to label the horizontal
segments of the diagram with the corresponding distance. The values associated
with the keys
icoordanddcoordgive the x and y coordinates of eachthree-segment inverted-U of the figure. In
augmented_dendrogramthis datais used to add a label of the distance (i.e. y value) of each horizontal
line segment in dendrogram.
For your
matarray, the augmented dendrogram isSo point ‘a’ and ‘c’ are 1.01 units apart, and point ‘b’ is 1.57 units from
the cluster [‘a’, ‘c’].
It seems that
show_leaf_countsflag is ignored, is there a way to turn it onso that the number of objects in each class is shown?
The flag
show_leaf_countsonly applies when not all the original datapoints are shown as leaves. For example, when
trunc_mode = "lastp",only the last
pnodes are show.Here’s an example with 100 points:
These are the points in the data set:
With
p=6andtrunc_mode="lastp",dendrogramonly shows the “top”of the dendrogram. The following shows the effect of
show_leaf_counts.