I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.
- Single host thread – multiple devices (shared context)
- Single host thread – concurrent execution of kernels on a single device (shared context)
- Multiple host threads – (Equal) Multiple devices (independent contexts)
- Single host thread – Sequential kernel execution on one device
- Multiple host threads – concurrent execution of kernels on one device (independent contexts)
- Multiple host threads – sequential execution of kernels on one device (independent contexts)
Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.
Thanks,
Sayan
EDIT:
I thought that the previous categorization involved some redundancy, so modified it.
Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.
Your option 1) is a misnomer, though, because there is no “shared context” – the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.