I have a site that has multiple Project objects. Each project has (for example):
- multiple tags
- multiple categories
- a size
- multiple types
- etc.
I would like to write a method to grab all ‘similar’ projects based on the above criteria. I can easily retrieve similar projects for each of the above singularly (i.e. projects of a similar size or projects that share a category etc.) but I would like it to be more intelligent then just choosing projects that either have all the above in common, or projects that have at least one of the above in common.
Ideally, I would like to weight each of the criteria, i.e. a project that has a tag in common is less ‘similar’ then a project that is close in size etc. A project that has two tags in common is more similar than a project that has one tag in common etc.
What approach (practically and mathimatically) can I take to do this?
The common way to handle this (in machine learning at least) is to create a metric which measures the similarity — A Jaccard metric seems like a good match here, given that you have types, categories, tags, etc, which are not really numbers.
Once you have a metric, you can speed up searching for similar items by using a KD tree, vp-tree or another metric tree structure, provided your metric obeys the triangle inequality( d(a,b) < d(a,c) + d(c, b) )