I translated the problem into an employee/salary problem for simplicity.
Having an employee record emp such as:
| id | salary (in 1000s) |
Given a number ‘num‘, find salary ‘sal‘ where the number of employees receiving salary<=sal are >=num (similar to area under curve problems in statistics).
We are using Python and SQLite, but the problem is not specific to them:
I’m doing the following (naive starting solution):
num = some_num
sal = 1000 # starting miminmum value
count = 0
while count < num:
sql = 'select count(*) from (select 1 from emp where salary<=? limit ?)'
# using limit so that we don't keep counting more than num - might help (?)
(count,) = cursor.execute(sql, (sal, num)).next() # using apsw sqlite adapter
sal += 1000
print sal
How can we make this more efficient? (Algorithmically or using standard SQL or equivalent, but not using quirks of a given system)
Or else: can it be made more efficient by adding extra fields to the record that can be kept up-to-date on insert/update ops without too much overhead?
If you are using a prepared statement, I believe you can move the preparation step out of the loop to make it much faster.
If you further want to improve performance and your db size is reasonably small, you could load the entire data into an array and do your operation.
I think further optimization is possible if you sort the array by salary first. After that you can do things like binary Search to where the
<condition flips and the index of that point + 1 would be the count.The solution is simpler than it looks. If the records are sorted by salary then the
#num'threcord’s salary would be the desired answer, so this becomes a problem of selecting the n’th row: