I am designing a metric to measure when a search term is “ambiguous.” A score near to one means that it is ambiguous (“Ajax” could be a programming language, a cleaning solution, a greek hero, a European soccer club, etc.) and a score near to zero means it is pretty clear what the user meant (“Lady Gaga” probably means only one thing). Part of this metric is that I have a list of possible interpretations and frequency of those interpretations from past data and I need to turn this into a number between 0 and 1.
For example: lets say the term is “Cats” — of a million trials 850,000 times the user meant the furry thing that meows, 80,000 times they meant the musical by that name, and the rest are abbreviations for things each only meant a trivial number of times. I would say this should have a low ambiguity score because even though there were multiple possible meanings, one was by far the preferred meaning. In contrast lets say the term is “Friends” — of a million trials 500,000 times the user meant the people who they hang out with all the time, 450,000 times they meant the tv show by that name, and the rest were some other meaning. This should get a higher ambiguity score because the different meanings were much closer in frequency.
TLDR: If I sort the array in decreasing order, I need a way to take arrays which fall off quickly to numbers close to zero and arrays that fall off slower to numbers closer to one. If the array was [1,0,0,0…] this should get a perfect score of 0 and if it was [1/n,1/n,1/n…] this should get a perfect score of 1. Any suggestions?
What you are looking for sounds very similar to the Entropy measure in information theory. It is a measure of how uncertain a random variable is based on the probabilities of each outcome. It is given by:
where
p(x[i])is the probability of theith possiblility. So in your case,p(x[i])would be the probability that a certain search phrase corresponded to an actual meaning. In the cats example, you would have:For the Friends case, you would have: (assuming only one other category)
The higher number here means more uncertainty.
Note that I am using log base 2 in both cases, but if you use a logarithm of the base equal to the number of possibilities, you can get the scale to work out to 0 to 1.
Note also that the most ambiguous case is when all possibilities have the same probability:
and the least ambiguous case is when there is only one possibility:
Since you want the most ambiguous terms to be near 1, you could just use
1.0-H(X)as your metric.