I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB.

Question

0

Asked: June 11, 20262026-06-11T01:22:49+00:00 2026-06-11T01:22:49+00:00

I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB.

0

I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB. And now i want to rewrite a part of it in parallel for speedup.
this part of code is:

siz=15;
X=normalize8(X);
[a,b]=size(X);
filt = fspecial('gaussian',[siz siz],sigma);
padsize = floor(siz/2);
padX = padarray(X,[padsize, padsize],'symmetric','both');

t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
    for j=padsize+1:b+padsize
        region = padX(i-padsize:i+padsize, j-padsize:j+padsize);
        means= mean(region(:));
        M=return_step(region, means);
        filt1=filt.*M;

        summ=sum(sum(filt1));        

        filt1=(filt1/summ);
        Z(i-padsize,j-padsize)=(sum(sum(filt1.*region))/(siz*siz));
    end
end
toc(t0) % -------------------------------------------------------------

and return_step function:

function M=return_step(X, means)

[a,b]=size(X);
for i=1:a
    for j=1:b
        if X(i,j)>=means
            M(i,j)=1;
        end
    end
end

I wrote below kernel function:

__global__ void returnstep(const double* x, double* m, double* filt, int leng, double mean, int i, int j, int width)
{
    int idx=threadIdx.y*blockDim.x+threadIdx.x;
    if(idx>=leng) return;

    int ridx= (j+threadIdx.y)*width+threadIdx.x+i;
    double xval= x[ridx];
    if (xval>=mean) m[idx]=filt[idx]*xval;
    else            m[idx]=0;
}

and then changed the MATLAB code as follow:

kernel= parallel.gpu.CUDAKernel('returnstep.ptx', 'returnstep.cu');
kernel.ThreadBlockSize= [double(siz) double(siz) 1];
GM = gpuArray(zeros(siz,siz));
GpadX = gpuArray(padX);
Gfilt = gpuArray(filt);

%% Process image
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
    for j=padsize+1:b+padsize
        means= mean(region(:));
        GM= feval(kernel, GpadX, GM, Gfilt, siz*siz, means, i-padsize-1, j-padsize-1, padXwidth);
        filt1=  gather(GM);

        summ=sum(sum(filt1));        

        filt1=(filt1/summ);
        Z(i-padsize,j-padsize)=(sum(sum(filt1))/(siz*siz));
    end
end
toc(t0) % -------------------------------------------------------------

my sequential code runs in 2.5s for a 330X200 image but the new parallel code’s run time is 15s. I don’t know why????
I need some advise for improving it. I am new in CUDA programming.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T01:22:50+00:00

Editorial Team

2026-06-11T01:22:50+00:00Added an answer on June 11, 2026 at 1:22 am

> help gather
...
X = GATHER(A) when A is a GPUArray, X is an array in the local workspace
with the data transferred from the GPU device.
....

filt1 = gather(GM) is copying GM from the GPU to the CPU in every step, which is very inefficient. You should move the entire computation inside the loop nest, or preferably the entire loop nest to the GPU kernel. Otherwise you can forget about any speedup.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply