I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB. And now i want to rewrite a part of it in parallel for speedup.
this part of code is:
siz=15;
X=normalize8(X);
[a,b]=size(X);
filt = fspecial('gaussian',[siz siz],sigma);
padsize = floor(siz/2);
padX = padarray(X,[padsize, padsize],'symmetric','both');
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
region = padX(i-padsize:i+padsize, j-padsize:j+padsize);
means= mean(region(:));
M=return_step(region, means);
filt1=filt.*M;
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1.*region))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
and return_step function:
function M=return_step(X, means)
[a,b]=size(X);
for i=1:a
for j=1:b
if X(i,j)>=means
M(i,j)=1;
end
end
end
I wrote below kernel function:
__global__ void returnstep(const double* x, double* m, double* filt, int leng, double mean, int i, int j, int width)
{
int idx=threadIdx.y*blockDim.x+threadIdx.x;
if(idx>=leng) return;
int ridx= (j+threadIdx.y)*width+threadIdx.x+i;
double xval= x[ridx];
if (xval>=mean) m[idx]=filt[idx]*xval;
else m[idx]=0;
}
and then changed the MATLAB code as follow:
kernel= parallel.gpu.CUDAKernel('returnstep.ptx', 'returnstep.cu');
kernel.ThreadBlockSize= [double(siz) double(siz) 1];
GM = gpuArray(zeros(siz,siz));
GpadX = gpuArray(padX);
Gfilt = gpuArray(filt);
%% Process image
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
means= mean(region(:));
GM= feval(kernel, GpadX, GM, Gfilt, siz*siz, means, i-padsize-1, j-padsize-1, padXwidth);
filt1= gather(GM);
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
my sequential code runs in 2.5s for a 330X200 image but the new parallel code’s run time is 15s. I don’t know why????
I need some advise for improving it. I am new in CUDA programming.
filt1 = gather(GM) is copying GM from the GPU to the CPU in every step, which is very inefficient. You should move the entire computation inside the loop nest, or preferably the entire loop nest to the GPU kernel. Otherwise you can forget about any speedup.