I’m trying to calcuate the confidence score that a string appears within a subset of a much larger set.
Say I have 10 words in my original list and I match a new word against all 10 words. Each match returns a similarity score. I’ve set a threshold to ignore any similarity score that is below 70%. So at the end I’m left with my input word possibly matching 3 words within my list.
To me this gives me 33.333% chance that my input word is a match against the 3 words with the higher similarity score. I want to calculate how confident I am that the word is a match is these three. I’ve calculated my confidence score as follows but this seems wrong and way to simple.
- Cat 1 – 70% similarity – 33.3% chance.
- Cat 2 – 75% similarity – 33.3% chance.
- Cat 3 – 80% similarity – 33.3% chance.
((0.70) * (0.333)) + ((0.75) * (0.333)) + ((0.80) * (0.333)) = 75% Confident.
What is the best method of calculating confidence levels?
EDIT: Better Sample as requested
Original Word Set
- Hello
- Help
- Hell
- Problem
- World
- Ocean
- Animal
- Carrot
- Brown
- Black
Match New Word – Helicopter against original word set.
The match returns 3 words from the original set with a similarity score of over 70%. The words returned were:
1. Hello – Similarity 70%
2. Help – Similarity 75%
3. Hell – Similarity 80%
I want to calcuate score that show how confident I am that helpicopter is a match to the words returned.
Answer: at [link] http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/ff9fc38e-8ca3-4d9a-b505-dfbe37910b17
Your probabilities are not right (or are not probabilities). You seem to have assumed that your word is a match for one of the top three similarity scores (if it is, your confidence level is de facto 100%…). Also, the probability and similarity scores are not independent, so your calculation is also flawed if you’re looking for anything that has a basis in probability/statistics.
What you have actually done is work out the mean “similarity” for the top three cases. If that’s acceptable as your (non-statistical) confidence level, then that’s fine. But you’re going to have to make a value call on this yourself – there’s no mathematical basis really to what you are trying to do. To help further, you’ll have to give us a lot more information on:
Edit following your edit:
Your three “similarity” scores are far from indepdendent, because the three words themselves are very “similar”. And in any event, any algorithm that says “helicopter” is 80% similar to “hell” is not very good. I’d say the confidence level is pretty close to zero in this case….!