The idea of my simple program that I’ve been trying to write is to take input from the user to see how large of a matrix to multiply.
dd@cuda-Linux:~/Desktop/multi$ ./program
What is the rowSize of a? 33
What is the colSize of a? 33
What is the rowSize of b? 33
What is the colSize of b? 33
Would you like to write the results to a file?(y or n)
y
Creating the random numbers now
Writing Matrix A to file now...
Writing Matrix B to file now...
Starting it on the device
Writing Matrix C to file now...
Finish
However the problems lies in my thread calculations. I can go to a 32×32 matrix and it will run fine and give me the correct results. However once I run a 33×33 I get results like the following:
[Matrix A] x [Matrix B] = [Matrix C] (linked to them instead of pasting several huge matrices into this post. But with matrix c you can see that half way through it starts to write the wrong numbers. My graphics card has a limit of 1024 threads which is a 32×32 matrix. Also when I go to run a 100×100 matrix Matrix C is all 0s.
Let mem_size_X be sizeof(float) * size_X, and size_X is height*width of the matrix. Right now the height and width has to be the same thus 32×32. Also the “block_size” is just the height. So with a 32×32 matrix the block size corresponds to 32.
Host code(launching):
float* deviceMatrixA;
float* deviceMatrixB;
cudaMalloc((void**) &deviceMatrixA, mem_size_A);//allocate mem_size_x on the device.
cudaMalloc((void**) &deviceMatrixB, mem_size_B);
cudaMemcpy(deviceMatrixA, a.elements, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(deviceMatrixB, b.elements, mem_size_B, cudaMemcpyHostToDevice);
int size_C = c.rowSize * c.colSize;
int mem_size_C = sizeof(float) * size_C;
c.elements = (float*) malloc(mem_size_C);
float* deviceMatrixC;
cudaMalloc((void**) &deviceMatrixC, mem_size_C);
dim3 threads(block_size, block_size);
dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);
matrixMul<<< grid, threads,2*block_size*block_size*sizeof(float)>>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize, block_size);//sizeof(float)*block_size*block_size
cudaThreadSynchronize();
The kernel code:
// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int aBegin = wA * block_size * by;
int aEnd = aBegin + wA - 1;
int aStep = block_size;
int bBegin = block_size * bx;
int bStep = block_size * wB;
float Csub=0;
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
extern __shared__ float As[];
extern __shared__ float Bs[];
extern __shared__ float smem[];
smem[ty*block_size+tx] = A[a + wA * ty + tx];
smem[block_size*block_size+ty*block_size+tx] = B[b + wB * ty + tx];
__syncthreads();
for (int k = 0; k < block_size; ++k)
Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
__syncthreads();
}
int c = wB * block_size * by + block_size * bx;
C[c + wB * ty + tx] = Csub;
}
Thanks
As I told you on your earlier, almost identical question, this matrix multiply code is only designed to do calculations on matrices whose dimensions are a round multiple of block_size. If you choose block_size=32, then it can only be used for 32×32, 64×64, 96×96, 128×128, etc. Nothing you have done with dynamically allocated shared memory changes this.
To verify that this is the case, let’s start with a complete, compilable repro case which will run your kernel, check whether it executed and compare its output to a simple reference calculation done on the host. This code is your posted kernel, plus the core of your launch parameter calculations. It will read a size from stdin and then run the case. If the results differ by more than a certain tolerance, an assert error will be raised. Here is the code, it should compile on CUDA 3.0 or later and run on any CUDA compatible GPU:
So now, let’s run this code for different sizes. To verify that the code on the GPU didn’t do anything wrong, I will run it using the cuda-memcheck utility, which can detect out of bounds memory access. All of the following tests were made on an OS X 10.6 machine with a compute capability 1.2 card and CUDA 3.2, using
blocksize=16:Let’s try a case where the matrices are less than
blocksizefirstHere we failed to run the kernel with an invalid configuration argument error. Why? Because of this:
which results in a 0 grid size when
m,n < blocksize.Next let’s try the smallest round multiple of blocksize, in this case 16:
which runs without error, or assert failure. Let’s now increase the size to 17:
and we get out of bounds memory access detected and a launch failure error, which is expected. Now lets try 64, 96, and 128:
and finally let’s try 129:
Even if you don’t follow why the out of bounds errors are occurring, are you at least willing to accept that this code really does only work correctly for matrices which are round multiples of blocksize?