I have four CUDA kernels working on matrices in the following way:
convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);
// A + B + C with CUBLAS' cublasSaxpy
every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).
Should I join these kernels into a single one by calling something like
multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)
?
Global memory should already be on the device so probably that would not help, but I’m not totally sure
If I understood correctly, you’re asking if you should merge the three “multiplybyElement” kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.
Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).