In my app each thread needs it’s own matrix of data. Let’s say, I

Question

0

Asked: June 15, 20262026-06-15T10:50:45+00:00 2026-06-15T10:50:45+00:00

In my app each thread needs it’s own matrix of data. Let’s say, I

0

In my app each thread needs it’s own matrix of data. Let’s say, I have T threads, and each thread works with different matrix D[M][N].

My question: how to organize the data structure?

My solution: I define an array A of T*M*N elements. To avoid bank conflicts, I store firstly element D[0][0] T times for each thread, then D[0][1] … D[0][M-1], D[1][0] and so on (if you look at this array like at matrix T * (M*N), you’ll have one column for each thread). In this way I have the same elements for different threads in different memory banks. Correspondingly, I access element D[i][j] for thread x in the following way: D[i][j](x) == A[T * (M * i + j) + x].

My problem: it’s computationally expensive to calculate complicated indexes.

P.S. I have Nvidia Tesla C2075 (CUDA 2.0).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T10:50:46+00:00

You say that M and N can be few hundred. For that you won’t be able to use shared memory much (if at all). You may watch global memory consumption carefully too (although Tesla have a lot of memory)!
200×200 x 3584threads (bare minimum for C2075 in my opinion) x sizeof(int) – that makes 547MB of data.

Global memory access patterns work differently. Global memory is divided into segments of 32, 64 and 128B. The cost of read is approximately the number of different segment access per warp. In short it usually boils down to – the more spread-out your access is – the worse.

So, unless every thread access its own matrix at the same index (at least most of the time), no memory organization will be efficient. However, if the above is true – then the layout you are describing may work.

Also, if you have spread-out access patterns, disabling L1 caching may help. This is because L1 cache line is 128B, but L2 only 32B – so you may reduce over-fetching. At least – try it 🙂

To ease the pain of accessing the array I would do something like this:

//let the kernel dimentions be known at compile time - you can safe some computation and registers
//assuming one-dimentional kernels

static const int blockSize = ...; //equivalent to blockDim
static const int gridSize = ...; //equivalent to gridDim
static const int rowSize = blockSize * gridSize;

template <typename T, int M, int N>
class MyMatrix {
private:
  T* data; //flattened array in global memory
  int tid;
public:
  __device__ inline MyMatrix(T* dataIsHere) : data(dataIsHere) {
    tid = threadIdx.x+blockDim.x*blockIdx.x;
  }
  __device__ inline T& operator()(int x, int y) {
    return data[(y*M+x)*rowSize+tid];
  }
}

//assuming the matrix size is 200x200 and consists of ints:

__global__ void myKernel(int* flattenedMatrices) {
  MyMatrix<int,200,200> matrix(flattenedMatrices);

  ...

  matrix(2,4) = .... // happily access the matrix for both loads and stores
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In my app each thread needs it’s own matrix of data. Let’s say, I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply