I have an application in which I am analyzing a system where there are a large number of interactions. And I need to make certain choices based on the frequency of the occurrences of unique items in the system. For example, if you had this list of letters:
A, B, F, G, A, T, S, B, S, B, S, Q, Z, B, Q, S
Here is a list showing how often each letter occurs (occurrences):
A - 2
B - 4
F - 1
G - 1
Q - 2
T - 1
S - 4
Z - 1
So the frequency of the occurrences are as such (occurrence occurrences):
4 - 2
2 - 2
1 - 4
The above is a tiny example, but I’ve attached an image which is a simple line graph of a larger system

In this graph the numbers along the bottom aren’t really important. They are just marking the number of unique frequencies. And the Y-axis marks the value of that frequency.
What I’m looking for is a mathematical/programmatic way to find the point where that line begins to break upwards. My searches haven’t yielded what I’m looking for as I’m not really sure what the proper terminology is, or the name of the concept.
Right now, we have to manually choose that point based on a human looking at the numbers and saying “here”. But I want to, at the very least, already have a “recommended” value chosen, and at the most, be able to remove the human component completely.
For clarification, my current algorithm is producing a list of number pairs occurrence to occurrence frequency. My use of the word “frequency” in no way relates to electromagnetic signals, but rather to how often an occurrence occurs. But I thought that saying “occurrence occurrences” would be more confusing!
In this system, the general trend is that a few entities will show up in a large number of interactions, more entities will show up in a medium number of interactions, but the greatest number of entities will show up in just a few, or even no, interactions. It would be tough to imagine a scenario where it was different than that… worst case would probably be a plateau. But there could definitely be a dip after a jump at any point from the beginning to the end. The illustration above just doesn’t show that. We cannot assume that there will be a point where it will begin to rise with no drops afterwards.
Here is my data. (The simple graph above was produced with the Occurrence Frequency column data only):

This list, as you can see, is sorted in descending order on the occurrence column. This is from a small system with 904 unique entities. Those entities have 38 unique occurrence rates. If you started at the top of this list, you could say:
"2 entities occur 309 times"
"1 entity occurs 130 times"
etc.
Ultimately what I’m trying to determine is the importance of an entity based on how often it occurs in the system. I need to be able to flag certain items as “important”, but all items can’t be important. And the method/algorithm I’m looking for would help to identify at what point in that list do I stop considering items important.
If you look at the list, you can see where the lower occurrences start becoming more frequent. I don’t think that I can sort on the right column because the left column is really the key data. Greater occurrences = more importance.
But I still need to figure out how to determine that.
Is there any reason the larger example isn’t sorted? If you sort it by increasing Y values, then you can take the slope of each consecutive pair, and call the breakpoint where the slope changes significantly.
You can tweak the rules for “changes significantly” to meet your exact needs. It might be as simple as “the slope that increase most compared to the previous”, or “the first slope that varies more than X% from the running average slope”. Or maybe the largest rss of the differences between the slope at the test point and the one before and the one after.
After the edit, I think it may be as simple as taking a percentage. Multiply each X and Y, and take the sum over all entries. That’s the total number of events observed. Now start from the bottom if your table, and start subtracting each row’s product from the total until you get to less than X% of the original total. What you are left with is the “significant” events that contributed most to the total.
I have a feeling this is a common problem in statistics, but I don’t have enough background to say what the proper terminology is, although standard deviations come to mind.