I’m designing an algorithm to compare two objects, I’ve got a formula, but I don’t know if it’s as good as it could be.
essentialy, i’m comparing tropes between two games to say how similar they are:
$divisor = ((count($similar_concepts) - $iterator) + ($total - $iterator) + ($iterator));
echo "<BR> Value: ".($iterator / $divisor);
But, thats not readable, so here is this:
SimilarTropes/( (OriginalTropes - SimilarTropes) + (NewTropes - SimilarTropes) + (SimilarTropes) )
I’m just not fully satisfied with the results, here’s an example:
Similarities: 47
NewTropes: 107
OriginalTropes: 156
Answer: 0.21759259259259
I don’t like these results because I feel those numbers should yeild a higher percentage of similarity.
I’d love some input here, and If i’m in the wrong place, at least some guidance on where I should go instead.
Thanks a lot!
Translation to Mathematics
Let me (attempt) to translate what you have into something of a more mathematical formula. It should be easier from there.
OriginalTropesis the number of tropes from some game, call itA. ThenNewTropesis tropes from some other game, call itB. ThenSimilaritiesis simply the intersection ofAandB. Your formula is then:Simplifying, we have:
In other words, you’re saying that the similarity is the ratio between the number of common items divided by the total number of items minus the number of items in common.
Now let’s take a couple of special cases. Take
A = B. Then we have:|Intersect(A, B)| = |A| = |B|. Your formula is then:Limitations
Let’s say now that the sets
AandBare equal in size. But, they only have half of their items in common. In other words,You similarity score is then:
Ideally, this should be
1/2, not1/3. You get something similar if you consider any sets where|A| = |B| = nand where|Intersect(A, B)| = n * pfor0 <= p <= 1.In general, for sets of the above form you end up with your similarity algorithm underestimating the similarity between the two sets. This looks something like the purple curve in the image below. The blue curve is what cosine similarity would give. So if 50% are common and they are equal size, the two sets have a similarity of
0.5. Likewise, if they have 90% in common then it has a similarity of0.9.Cosine Similarity
What you may wish for is something similar to the angle between the two sets. Consider the total set of elements,
Intersect(A, B)and defineN = |Intersect(A, B)|. Letaandbbe anNdimensional representation ofAandB, where each element has value1if present in the original set or0if not.Then you use the cosine of the angle as:
Cos(theta) = Dot(a, b) / (||a|| * ||b||)Note that the notation
||a||refers to the euclidean length, not the size of the set. This may have better properties than what you were using before.Example
Here’s an example. Let’s say:
Then the full distinct set,
Union(A, B)is given as:This means that
N = |Union(A, B) = 5. The tricky party becomes how to index each of these appropriately. You can actually use a dictionary plus a counter to index the elements. I’ll leave this to you to try out. For now, we’ll use the ordering ofUnion(A, B). Thenaandbare given as:At this point it becomes standard mathematics:
Sample Implementation