I have a large table of N items with M (M>=3) distinct properties per item,
From this table I have to remove all items for which the same table contains an item that scores equal or better on all properties.
I have an algorithm (python) that solves it already, but it is output-sensitive and has a worst case of approx. O((n²+n)/2) when no items are removed in the process.
This is far too slow for my project (where datasets of 100,000 items with 8 properties per item are not uncommon), so I require something close to O(m*n log n) worst case, but I do not know whether this problem can be solved that fast.
Example problem case and its solution:
[higher value = better]
Singing Dancing Acting
A 10 20 10
B 10 20 30
C 30 20 10
D 30 10 30
E 10 30 20
F 30 10 20
G 20 30 10
Dismiss all candidates for which there is a candidate that performs equal or
better in all disciplines.
Solution:
– A is dismissed because B,C,E,G perform equal or better in all disciplines.
– F is dismissed because D performs equal or better in all disciplines.
Does there exist an algorithm that solves this problem efficiently, and what is it?
A general answer is to arrange the records into a tree and keep notes at each node of the maximum value in each column for the records lying beneath that node. Then, for each record, chase it down the tree from the top until you know whether it is dominated or not, using the notes at each node to skip entire subtrees, if possible. (Unfortunately you may have to search both descendants of a node). When you remove a record as dominated you may be able to update the annotations in the nodes above it – since this need not involve rebalancing the tree it should be cheap. You might hope to at least gain a speedup over the original code. If my memory of multi-dimensional search is correct, you could hope to go from N^2 to N^(2-f) where f becomes small as the number of dimensions increases.
One way to create such a tree is to repeatedly split the groups of records at the median of one dimension, cycling through the dimensions with each tree level. If you use a quicksort-like median search for each such split you might expect the tree construction to take you time n log n. (kd-tree)
One tuning factor on this is not to split all the way down, but to stop splitting when the group size gets to some N or less.