I thought about splitting up recursion into smaller recursive sizes, then wondered if it is of any actual practical use, also considering parallelism.
To make clear what I mean, a small example(merge sort):
Instead of doing:
...
merge_sort(b, m);
merge_sort(m, e);
merge(b, m, e);
...
doing something like this:
...
merge_sort_quad(b, m1);
merge_sort_quad(m1 + 1, m2);
merge_sort_quad(m2 + 1, m3);
merge_sort_quad(m3 + 1, e);
merge_quad(b, m1, m2, m3, e);
...
Cosidering a parallel example, I don’t see a basic difference in both approchaes, as they will probably result in the same:
void foo (..) {
...
//using tbb::prallel_invoke() to call functions in parallel
tbb::parallel_invoke(foo(..), foo(..));
...
}
void foo_parallel (..) {
...
tbb::parallel_invoke(foo(..), foo(..), foo(..), foo(..));
...
}
I hope you guys can explain me if this is totally useless and bad or if it is algorithm dependent and might be of some practical use. I doubt it, as it looks a bit like manual loop unrolling.
You are correct, and indeed this is done with merge-sort. There’s a few different ideas together in your question, and some have further implications, so let’s divide them out. I’ll go over a few things I think you’re quite possibly clearer on than I am, because it’ll make for a more coherent answer for anyone else reading it.
First recursion. There’s logical recursion where we break a problem down into repeated versions of itself until they reach some point where they are trivial (classically, factorial by multiplying the current number by the factorial of one less until we reach 1), and there’s functional recursion where we model this by having a function call itself.
Logical recursion is a problem-solving technique for people. Functional recursion is a programming technique that reflects it. However, functional recursion can cost more than the iterative equivalent. Hence we often either have our compilers turn them into the iterative equivalent, have tail-call optimisation which pretty much does that too (by removing most or all of the cost of recursive calls) or when that fails, convert to iterative versions ourselves.
Now, in the particular sort of recursion we have with a merge-sort we increase the number of simpler tasks as we break the problem down. That is rather than
n!becoming the single task ofn × (n - 1)!, merge-sort becomes two tasks of merging two halfs of the sequence to merge, followed by the task of merging the results.You’ve made the correct jump to concluding that this can lead to a parallel approach. There’s some further features to this that makes it interesting. If we broke it down to 4 merges like you have done, and gave each merge to a different core, then each core will be dealing with memory that will be close together and load into caches together (the way that data being close together can help us) but it’s relatively unlikely that one thread will write to data in the same cache-line that another thread is interested in and force it to suffer from the cache being invalidated (“false sharing” the way that data being close together can hurt us).
The sort being likely to be bound only on CPU and memory, there probably won’t be much to gain after 1 thread per core or at most 1 thread per virtual processor if hyperthreaded.
Therefore splitting into separate function calls benefits performance up to the number of virtual processors. The example in your question would be the idea on a four-processor machine. After that, it’s unlikely that one thread would be able to help much by work-stealing from another when it came to an end, so from that point on you’re probably better taking an iterative approach (whether hand-coded as such or turned into such by the compiler). Taking a functionally recursive approach beyond the point where we have a function per processor begins to hurt us again. However, it’s always possible that we mis-calculate how many cores we actually have to use (because other processes are using them too) so it can be worth going a bit further than function-per-core and allowing those that finish first to take the left-overs.
There’s quite a bit of stuff on paralleling merge-sorts in the literature, and some frameworks and libraries have merge-sort implementations that make use of it.