i am implementing the simple bubble sort algorithm using CUDA, and i have a question.
i perform the following code in order to swap 2 consecutive elements in the array:
if(a[threadIdx.x]>a[threadIdx.x + 1])
Swap(a[threadIdx.x] , a[threadIdx.x + 1]);
note that the number of threads in the block is half the size of the array. Is this a good implementation? would threads in a single warp execute in parallel even if there is a branch? therefore it would actually take N iterations in order to sort the array?
also note that i know that there are better sorting algorithms that i could implement,and i can use Thrust, CUDPP, or a sample sorting algorithm from the SDK, but in my case, i just need a simple algorithm to implement.
I’m glad you realise that bubble sort on the GPU is likely to perform terribly badly! I’m struggling to figure out how to get sufficient parallelism without having to launch many kernels. Also, you may struggle to work out when you’re done.
Anyway, to answer your specific question: yes, it’s highly likely that you will have warp divergence in this case. However, given that the “else” branch is effectively empty, this will not slow you down. On average (until this list is sorted), roughly half the threads in a warp will take the “if” branch, the other threads will wait, and then when the “if” branch is complete, the warp threads can go back to being in-step. This is far from your biggest problem 🙂