I’m doing a research about gpu in cluster environments using mpi to communicate.
In order to compare speed up, I think in create:
A Multiplication of matrix just for GPU, ok.
Now just CPU MatrixMulti, ok.
But I can’t find a nice implementation of CUDA + MPI matrix multiplication.
Anyone have some hint about where I can fin this? Or suggest one implementation.
The MTL4 Matrix Template Library can be a great starting point. Right now MTL4 has multi-core, DMM, and we are almost done with a full GPU implementation. Peter and I have been talking about distributed GPU algorithms, but since our focus is driven by PDE solvers for the moment, distributed GPU algorithms are difficult to make competitive against robust DMM.
However, I am working on a new geophysics/medical imaging solver set that is more conducive for distributed GPU computes as the data sets are more modest and the video capabilities of the GPU are beneficial.
To get started, take a look at the MTL4 tutorial