The application is to intersect two sorted list of integers (set intersection), say list1 and list2.
Each element of list1 will be assigned a GPU thread, and do binary search to check whether it appears in the list2. It is easy to see that there will be huge amount thread divergences in this application. I wonder if there is any good approach to reduce thread divergences. I am using CUDA to implement this application.
I know there is an approach called P-ary search, but my task is to reduce the thread divergence of binary search. Also I know there is library called thrust, but it seems there is no attempt on reducing the divergences.
If both lists are sorted, binary search is not the best algorithm you can do. Binary search will give
O(n lg n), but just doing a merge-like algorithm, only taking intersections, isO(n).This is a silly algorithm to use a GPU for. The only case I see is that you’ve just generated the data in the GPU. In which case, you want to break the problem up into a bunch of smaller intersections and assign a thread to each.
To do that, pick
kequally-spaced elements of list1 and find them in list2 using binary search. Similarly, pickkequally-spaced elements of list2 and find them in list1. You now have2kranges in each list, where each range has at mostN/kelements. Now intersect those ranges in parallel. (Setkto be half the number of threads you want.)