I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?
__global__ void Hist(
TColor *dst, //input image
int imageW,
int imageH,
int*data
){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < imageW && iy < imageH)
{
int pixel = get_red(dst[imageW * (iy) + (ix)]);
//this assign specific RED value of image to pixel
data[pixel] ++; // ?? problem statement ...
}
}
@para d_dst: input image TColor is equals to float4.
@para data: the array for histogram size [255]
extern "C" void
cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);
}
Have you looked at the SDK sample? The “histogram” sample is available in the CUDA SDK (currently version 3.0 on the NVIDIA developer site, version 3.1 beta available for registered developers).
The documentation with the sample explains nicely how to handle your summation, either using global memory atomics on the GPU or by collecting the results for each block separately and then doing a separate reduction (either on the host or the GPU).