I was reading Supercomputing for the Masses: Part 5 on Dr.Dobb’s and I have a question concerning the author’s code for (fast) reversing arrays.
I understand the need to use shared memory but I didn’t get the performance gain in the code of reverseArray_multiblock_fast.cu
In reverseArray_multiblock_fast.cu an array element is transfered form global memory to shared memory, and then from shared memory to global memory. I cannot understand why this is better than directly reading an array element from the global memory, and writing it to another index in the global memory.
Could you please explain this to me?
check out Supercomputing for the Masses: Part 6
it explains everything…