I have four CUDA kernels working on matrices in the following way: convolution<<<>>>(A,B); multiplybyElement1<<<>>>(B);

Question

0

Asked: June 2, 20262026-06-02T05:10:02+00:00 2026-06-02T05:10:02+00:00

I have four CUDA kernels working on matrices in the following way: convolution<<<>>>(A,B); multiplybyElement1<<<>>>(B);

0

I have four CUDA kernels working on matrices in the following way:

convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);

// A + B + C with CUBLAS' cublasSaxpy

every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).

Should I join these kernels into a single one by calling something like

multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)

?

Global memory should already be on the device so probably that would not help, but I’m not totally sure

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T05:10:04+00:00

If I understood correctly, you’re asking if you should merge the three “multiplybyElement” kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.

Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have four CUDA kernels working on matrices in the following way: convolution<<<>>>(A,B); multiplybyElement1<<<>>>(B);

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply