I am currently involved in developing a large scientific computing project, and I am exploring the possibility of hardware acceleration with GPUs as an alternative to the MPI/cluster approach. We are in a mainly memory-bound situation, with too much data to put in memory to fit on a GPU. To this end, I have two questions:
1) The books I have read say that it is illegal to access memory on the host with a pointer on the device (for obvious reasons). Instead, one must copy the memory from the host’s memory to the device memory, then do the computations, and then copy back. My question is whether there is a work-around for this — is there any way to read a value in system RAM from the GPU?
2) More generally, what algorithms/solutions exist for optimizing the data transfer between the CPU and the GPU during memory-bound computations such as these?
Thanks for your help in this! I am enthusiastic about making the switch to CUDA, simply because the parallelization is much more intuitive!
1) Yes, you can do this with most GPGPU packages.
The one I’m most familair with — the AMD Stream SDK lets you allocate a buffer in “system” memory and use that as a texture that is read or written by your kernel. Cuda and OpenCL have the same ability, the key is to set the correct flags on the buffer allocation.
BUT…
You might not want to do that because the data is being read/written across the PCIe bus, which has a lot of overhead.
The implementation is free to interpret your requests liberally. I mean you can tell it to locate the buffer in system memory, but the software stack is free to do things like relocate it into GPU memory on the fly — as long as the computed results are the same
2) All of the major GPGPU software enviroments (Cuda, OpenCL, the Stream SDK) support DMA transfers, which is what you probably want.