I have a clustering problem that could be summarized this way:
- i have N particles in a 3D spaces
- each particle can interact with a different number of other particles
- each interaction has a strength
- i don’t know the number of cluster a priori
- i don’t have leaning samples (should be unsupervised)
Output: i’d like to get:
- the number of clusters
- a probability for each particle to be part of a cluster (to be able to remove particles not clearly assigned)
- i want to call the clusterer directly from my java code.
Question:
- what clusterer would fit best to my problem?
- how should i format my data?
- should i use the 3D positioning information in complement to the interaction information?
- how can i get the result for each particle?
I’m very new to weka, but from what i could find on the Internet:
- SOM could solve my problem
- it is a multi-instance problem but i could find any examples showing how to create relational data. and does SOM support relational attributes?
Thanks for your help.
jeannot
Weka is very “limited” when it comes to clustering. It has only very few clustering algorithms, and they are quite limited. I’m not sure if you could put in the interaction strength into any of the Weka clustering algorithms.
You might want to have a look at ELKI. It has much more advanced clustering algorithms than Weka, and they are very flexible. For example, you can easily define your own distance function (Tutorial) and use it in any distance-based clustering algorithm.
Choosing the appropriate clustering algorithm is nothing we can answer here. You need to try some and try different parameters. The key question you should try to answer first is: what is a useful cluster for you?
You have started to pose some of these questions. For example, whether you want to use interaction strength only, or whether to also include positional information. But as I do not know what you want to achieve, I can’t tell you how.
Definitely have a look at the DBSCAN and OPTICS algorithms (in particular for OPTICS, don’t use the one in Weka. It is slow, incomplete and unmaintained!). Maybe start reading their Wikipedia articles, if that makes any sense for your task. Here is why I believe they are helpful for you:
Next I would probably use the interaction-strength data with OPTICS and try the Xi-extraction of clusters, if they make any sense for your use case. (Weka doesn’t have the Xi extraction). Or maybe look at the OPTICS plot first, to see if your similarity and MinPts parameter actually produce the “valleys” you need for OPTICS.
DBSCAN is faster, but you need to fix the distance threshold. If your data set is very large, you might want to start with OPTICS on a sample, then decide on a few epsilon-values and run DBSCAN on the full dataset with these values.
Still, start reading here to see if that makes sense for your task:
https://en.wikipedia.org/wiki/DBSCAN#Basic_idea