I have to do a gpu implementation (opencl) of a image mapping.
I seem to remember having read somewhere that forward mapping is better suited for a parallel implementation, why is that?
And do anyone have some example code on how to do these mappings (preferably on the gpu)?
To me the intuitive choice for a parallel implementation would be an inverse, not forward, mapping.
Consider instances where several source pixels map to a single destination pixel. In forward mapping, if each source pixel were evaluated as a distinct work-item, you would have to implement some kind of synchronization on the destination pixel to co-ordinate the multiple writes. In inverse mapping there is no synchronization overhead, since it is guaranteed that only one work-item writes to each pixel.
Example inverse-mapping kernel code, leveraging OpenCL’s image2d_t and sampler_t concepts for image manipulation:
Of course there are exceptions where forward mapping might be preferable. For example if you had a very large source image and a small destination image, then forward mapping would allow you to split the source image into segments, then divide them amongst work-items or work-groups with the segment data cached in __private or __local address spaces. Without prior knowledge of the mapping function, an inverse mapping might need to access any part of the source image, which potentially restricts you to __global memory.