Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.
If
imageis in global memory then there is no problem – you don’t need to use shared memory and you can just access pixels directly fromimagewithout any problem.However if you have already done some processing prior to this, and a block of
imageis already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following – either:or:
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.