I want to partition a large data set and split the work on multiple GPUs. I want to make these data static so that I don’t have to load to GPU for the second run. Now the problem is that, pthread_create requires all input data be assembled into a “struct”, and I am not sure whether assembling a bunch of static data into a struct will work. Thanks for any suggestions.
Share
In “modern” CUDA multi-gpu, it is no longer necessary to use a different host thread to hold a context on a given device. Since CUDA 4.0, the API is thread safe, and one host thread can hold and work with multiple contexts simply using
cudaSetDevice.A really, really basic example of how to distribute a large dataset over multiple GPUs in CUDA 4.x or CUDA 5 could be as simple as:
[disclaimer: written in browser, never compiled or tested, use a own risk]
assuming that the GPUs are sequentially numbered from [0,ngpus-1] and the source data is held in the floating point array
host_arrayof lengthN. You get back an array of device pointers inpvalsand the length of each array inplens. Note that each pointer is only valid in the context in which you allocated it, so make sure you select the device before using the pointer with a kernel launch or API call.