I do some calculations in my code on a float* memory block. Since I

Question

0

Asked: June 4, 20262026-06-04T02:43:28+00:00 2026-06-04T02:43:28+00:00

I do some calculations in my code on a float* memory block. Since I

0

I do some calculations in my code on a float* memory block. Since I am working on an image, I will have to do this with width * height points, and 180 rotations. I am starting 180 threads (1 per degree rotation), since this is the only parallelizable procedure in the code.
I rotate the images, and get for every rotation, for every point a resulting float value.
I have another float* block which stores the current maximum value at each point.

if(resultMap[i] < convrst)
{
    resultMap[i] = convrst;
    rMap[i] = (unsigned char)r0;
    oMap[i] = (unsigned char)index;
}

with resultMap storing the current highest value. convrst is the result of the convolution, and if the current result is higher than the ones before, it will save the value, plus the radius (r0) and rotation (index) at that point. r0 is originally an int, as well as index.
i is an counter going from 0 to imgsize-1

Without the assignments in the { } part, the whole code will finish within like 2s, with the assignments it takes like 50s (and this is not taking into account that in that code I left out the locks to avoid synchronization problems).

Why is that code so slow, and what can I do to make it finish faster?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T02:43:29+00:00

The reason for the large slowdown when you include writes in your kernel is the same as in this question (although it is about OpenCL the principle is the same). The NVIDIA compiler is extremely aggressive at optimising away “dead” code, that is code which does not contribute to a shared memory or global memory write. So I would guess that when you don’t include a write to global memory, as shown in your question, the compiler is just optimising large amounts of the kernel away, greatly reducing the execution time.

So, as in the other question I linked to, the real question should be why your kernel taking 50 seconds to finish. That will require more information about the code and the execution parameters you are calling it with, but if, as you wrote, you are running only 180 threads, that is the likely culprit. The GPU needs a lot more parallelism than that to achieve anything like peak performance.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I do some calculations in my code on a float* memory block. Since I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply