Okay – I have a dilemma. So far my script converts page titles into categories. This is based on keywords, and when there is a match a certain score is added, I.e some words hold the value of 10, some only 1. This gets accumulated into a total score for each category.
[{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23,:[3, 'automotive service']}]
Index being the category id, first value the score second value the category.
In some instances this spans to over 10 category matches.
How can I filter this to only the top 60-75%
I.e clearly massages and hair salon are the most as they are well above automotive service. But how can this intelligence we use be programmed?
I was thinking stddev could help?
Edit
I am trying to filter out low scoring items e.g.
data = [{15: [32, 'massages']}, {45: [1, 'hair-salon']}, {23:[1, 'automotive service']}]]
Massages is the only high scoring item in this instance
data = [{15: [4, 'massages']}, {45: [2, 'hair-salon']}, {23:[1, 'automotive service']}]]
Stil massages
data = [{15: [10, 'massages']}, {45: [50, 'hair-salon']}, {23:[5, 'automotive service']}]]
Now hair-salon (as it is well above others)
So I need not take the first (N) objects, moreso, the first objects that are x higher then other numbers as a percentage or form of standard deviation.
So 50 is much higher then 10 and 5
10 is much higher then 3 or 2
However 9, 8 and 6 are much the same
returns
(the two highest-scoring items)