I want grouped ranking on a very large table, I’ve found a couple of solutions for this problem e.g. in this post and other places on the web. I am, however, unable to figure out the worst case complexity of these solutions. The specific problem consists of a table where each row has a number of points and a name associated. I want to be able to request rank intervals such as 1-4. Here are some data examples:
name | points Ab 14 Ac 14 B 16 C 16 Da 15 De 13
With these values the following ‘ranking’ is created:
Query id | Rank | Name 1 1 B 2 1 C 3 3 Da 4 4 Ab 5 4 Ac 6 6 De
And it should be possible to create the following interval on query-id’s: 2-5 giving rank: 1,3,4 and 4.
The database holds about 3 million records so if possible I want to avoid a solution with complexity greater than log(n). There are constantly updates and inserts on the database so these actions should preferably be performed in log(n) complexity as well. I am not sure it’s possible though and I’ve tried wrapping my head around it for some time. I’ve come to the conclusion that a binary search should be possible but I haven’t been able to create a query that does this. I am using a MySQL server.
I will elaborate on how the pseudo code for the filtering could work. Firstly, an index on (points, name) is needed. As input you give a fromrank and a tillrank. The total number of records in the database is n. The pseudocode should look something like this:
Find median point value, count rows less than this value (the count gives a rough estimate of rank, not considering those with same amount of points). If the number returned is greater than the fromrank delimiter, we subdivide the first half and find median of it. We keep doing this until we are pinpointed to the amount of points where fromrank should start. then we do the same within that amount of points with the name index, and find median until we have reached the correct row. We do the exact same thing for tillrank.
The result should be log(n) number of subdivisions. So given the median and count can be made in log(n) time it should be possible to solve the problem in worst case complexity log(n). Correct me if I am wrong.
You need a stored procedure to be able to call this with parameters:
If you create the index and force
MySQLto use it (as in my query), then the complexity of the query will not depend on the number of rows at all, it will depend only ontillrank.It will actually take last
tillrankvalues from the index, perform some simple calculations on them and filter out firstfromrankvalues.Time of this operation, as you can see, depends only on
tillrank, it does not depend on how many records are there.I just checked in on
400,000rows, it selects ranks from5to100in0,004seconds (that is, instantly)Important: this only works if you sort on names in
DESCENDINGorder.MySQLdoes not supportDESCclause in the indices, that means that thepointsandnamemust be sorted in one order forINDEX SORTto be usable (either bothASCENDINGor bothDESCENDING). If you want fastASCsorting byname, you will need to keep negative points in the database, and change the sign in theSELECTclause.You may also remove
namefrom the index at all, and perform a finalORDER‘ing without using an index:That will impact performance on big ranges, but you will hardly notice it on small ranges.