In my app each thread needs it’s own matrix of data. Let’s say, I have T threads, and each thread works with different matrix D[M][N].
My question: how to organize the data structure?
My solution: I define an array A of T*M*N elements. To avoid bank conflicts, I store firstly element D[0][0] T times for each thread, then D[0][1] … D[0][M-1], D[1][0] and so on (if you look at this array like at matrix T * (M*N), you’ll have one column for each thread). In this way I have the same elements for different threads in different memory banks. Correspondingly, I access element D[i][j] for thread x in the following way: D[i][j](x) == A[T * (M * i + j) + x].
My problem: it’s computationally expensive to calculate complicated indexes.
P.S. I have Nvidia Tesla C2075 (CUDA 2.0).
You say that M and N can be few hundred. For that you won’t be able to use shared memory much (if at all). You may watch global memory consumption carefully too (although Tesla have a lot of memory)!
200×200 x 3584threads (bare minimum for C2075 in my opinion) x sizeof(int) – that makes 547MB of data.
Global memory access patterns work differently. Global memory is divided into segments of 32, 64 and 128B. The cost of read is approximately the number of different segment access per warp. In short it usually boils down to – the more spread-out your access is – the worse.
So, unless every thread access its own matrix at the same index (at least most of the time), no memory organization will be efficient. However, if the above is true – then the layout you are describing may work.
Also, if you have spread-out access patterns, disabling L1 caching may help. This is because L1 cache line is 128B, but L2 only 32B – so you may reduce over-fetching. At least – try it 🙂
To ease the pain of accessing the array I would do something like this: