Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.
hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).
I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.
To beat O(n^2), you’ll have to first reduce your 1M points (documents)
to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or …
Two possible approaches:
build a hierarchical tree from say 15k points, then add the rest one by one:
time ~ 1M * treedepth
first build 100 or 1000 flat clusters,
then build your hierarchical tree of the 100 or 1000 cluster centres.
How well either of these might work depends critically
on the size and shape of your target tree —
how many levels, how many leaves ?
What software are you using,
and how many hours / days do you have to do the clustering ?
For the flat-cluster approach,
K-d_tree s
work fine for points in 2d, 3d, 20d, even 128d — not your case.
I know hardly anything about clustering text;
Locality-sensitive_hashing ?
Take a look at scikit-learn clustering —
it has several methods, including DBSCAN.
Added: see also
google-all-pairs-similarity-search
“Algorithms for finding all similar pairs of vectors in sparse vector data”, Beyardo et el. 2007
SO hierarchical-clusterization-heuristics