Given a large collection (let’s call it ‘a’) of elements of type T (say, a Vector or List) and an evaluation function ‘f’ (say, (T) => Double) I would like to derive from ‘a’ a result collection ‘b’ that contains the N elements of ‘a’ that result in the highest value under f. The collection ‘a’ may contain duplicates. It is not sorted.
Maybe leaving the question of parallelizability (map/reduce etc.) aside for a moment, what would be the appropriate Scala data structure for compiling the result collection ‘b’? Thanks for any pointers / ideas.
Notes:
(1) I guess my use case can be most concisely expressed as
val a = Vector( 9,2,6,1,7,5,2,6,9 ) // just an example
val f : (Int)=>Double = (n)=>n // evaluation function
val b = a.sortBy( f ).take( N ) // sort, then clip
except that I do not want to sort the entire set.
(2) one option might be an iteration over ‘a’ that fills a TreeSet with ‘manual’ size bounding (reject anything worse than the worst item in the set, don’t let the set grow beyond N). However, I would like to retain duplicates present in the original set in the result set, and so this may not work.
(3) if a sorted multi-set is the right data structure, is there a Scala implementation of this? Or a binary-sorted Vector or Array, if the result set is reasonably small?
You can use a priority queue:
We fill the queue with the first
kelements and then compare each additional element to the head of the queue, swapping as necessary. This works as expected and retains duplicates:And it doesn’t sort the complete list. I’ve got an
Orderingin this implementation, but adapting it to use an evaluation function would be pretty trivial.