While most questions are about grouping nodes based on similarity (pidgeonholes), I would like to group nodes based on simply their proximity.
I have a large, dense collection of nodes- Potentially millions. On-screen they take up some amount of space, so they can be thought of as having a size.
What I am trying to do is to group these nodes into single containing nodes efficiently, both in processing time and also in collecting more nodes per container.
My current attempts have either been too slow, or didn’t work, but are all based off of the same solution I have in mind: Calculate a lot of possible containers by taking a node and it’s surrounding nodes at random and grouping them, then picking the most effective container.
What are your ideas, not specifically in any language, but I will be using PHP or JavaScript for this.
Edit
I forgot to mention that the nodes will be streamed in, so it needs to accept unlimited nodes, putting them into containers as they come, creating new containers or even deleting them as necessary, for up to millions of containers. That would be the most ideal.
This problem is called clustering. You have a set of nodes and a function
mthat calculates the distance between any two nodes. You now search for clusters so that the sum of all the distances between all nodes inside each cluster is minimal.There are some easy algorithms to do this. Search for
k-Meansandk-Medoidfor example. These two are very similar to your approach. A more efficient version is theCLARANSalgorithm [NH94]. I didn’t find any good sources for you but here you go:(German) Script on clustering in general. Contains CLARANS in pseudo-code on page 45
http://www.informatik.hu-berlin.de/forschung/gebiete/wbi/teaching/archive/ws1112/vl_datawarehousing/15_clustering_12.pdf
English script that explains CLARANS
http://bib.dbvis.de/uploadedFiles/232.pdf
Paper about CLARANS
http://www.comp.nus.edu.sg/~atung/publication/pakdd002.pdf
The “k” in the names is the number of clusters. For those 3 algorithms you have to specify the number of clusters a priori.
For a different approach, see the DBSCAN algorithm. You won’t need the number of clusters for this algorithm, but you have to provide some other knowledge of your nodes. The wikipedia article explains this very well. 🙂