Say I have the following example distribution (vector) of numbers in c++:
vector 1 vector 2 vector 3
11 4 65
128 6 66
12 4 64
13 4 62
12 5 65
14 5 63
16 7 190
60 3 210
120 4 220
126 5 242
77 6 231
14 4 210
12 7 222
13 6 260
11 8 300
14 6 233
99 80
15 66
13
I need to find a threshold for each vector. I’ll eliminate the larger (“bad”) numbers in each if they are above that vector’s threshold. I want to re-use this method to find a threshold on other similar vectors in the future. The numbers aren’t necessarily mostly smaller “good” numbers.
The threshold would ideally be just a hair larger than most of the smaller “good” numbers. For example, the first vetor’s ideal threshold value would be around 17 or 18, the second’s would be about 8, and the third’s would be around 68-70.
I realize this is probably simple math but since I’m horrible at math in general, I would really appreciate a code example on how to find this magical threshold, in either C++ or Objective-C specifically, which is why I’m posting this in SO and not on the Math site.
Some things I’ve tried
float threshold = mean_of_vector;
float threshold = mean_of_vector / 1.5f;
float threshold = ((max_of_vector - min_of_vector) / 2.0f) + mean_of_vector;
Each of these seem to have their own issues, eg: some include too many of the “good” average numbers (so the threshold was too low), some not enough good numbers (threshold too high), or not enough of the “bad” numbers. Sometimes they’ll work with specific vectors of numbers, for example, if the standard deviation is high, but not others where the standard deviation is low.
I’m thinking the method would involve standard deviation and/or some sort of gaussian distribution, but I don’t know how to piece them together to get the desired result.
Edit: I am able to re-sort the vectors.
You could just eliminate the values above 90% or 95%.
Technicaly you calculate the p = 0.9 (or 0.95) percentile of the array distribution.
Just sort the array ascending:
Then calculate position of percentile p:
Now filter array by keeping all value < or <= threshold.
This keeps the 90% of smallest values.
For mathematically “perfect” results you could search for “calculate percentile of discrete array / values).
As i remeber there are two valid algorithms, describeing whether one has to round down or round up the
posInt. I my example above I just truncated.