I am struck up with Matrix multiplication on CUDA. The resultant product matrix is

Question

0

Asked: May 20, 20262026-05-20T07:51:26+00:00 2026-05-20T07:51:26+00:00

I am struck up with Matrix multiplication on CUDA. The resultant product matrix is

0

I am struck up with Matrix multiplication on CUDA. The resultant product matrix is always zero. I have read some sample codes like matrix multiplication in cuda for resolving my problem, but all in vain.

Apart from erratic result of 0, the maximum size of “Width” (code below) is not even 512. I was not able to debug where the problem lies. May be we can discuss it on StackOverflow.

I am referring “Programming Massively Parallel Processors”

#include<cuda.h>
#include<stdio.h>

int main(void) {
    void MatrixMultiplication(float *, float *, float *, int);
    const int Width = 5;
    float M[Width*Width], N[Width*Width], P[Width*Width];
    for(int i = 0; i < (Width*Width) ; i++) {
        M[i] = 5;
        N[i] = 5;
        P[i] = 0;
    }
    MatrixMultiplication(M, N, P, Width);
    for(int i = 0; i < (Width*Width) ; i++) {
        printf("%d \n", P[i]);
    }
    int quit;
    scanf("%d",&quit);
    return 0;
}

//Matrix multiplication kernel - thread specification
__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width) {
    //2D Thread ID
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    //Pvalue stores the Pd element that is computed by the thread
    float Pvalue = 0;

    for(int k = 0; k < Width ; ++k) {
        float Mdelement = Md[ty*Width + k];
        float Ndelement = Nd[k*Width + tx];
        Pvalue += (Mdelement*Ndelement);
    }

    Pd[ty*Width + tx] = Pvalue;
}

void MatrixMultiplication(float *M, float *N, float *P, int Width) {
    int size = Width*Width*sizeof(float);
    float *Md, *Nd, *Pd;

    //Transfer M and N to device memory
    cudaMalloc((void**)&Md, size);
    cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);
    cudaMalloc((void**)&Nd, size);
    cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);

    //Allocate P on the device
    cudaMalloc((void**)&Pd,size);

    //Setup the execution configuration
    dim3 dimBlock(Width,Width);
    dim3 dimGrid(1,1);

    //Launch the device computation threads!
    MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

    //Transfer P from device to host
    cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);

    //Free device matrices
    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T07:51:27+00:00

I figured out what was wrong. Let’s analyze it :

Point 1 : The quest to remove the ever monotonic “zero value”

As noted, you must replace printf("%d \n", P[i]); as printf("%f \n", P[i]);

Point 2 : Why the program fails with a value of Width 512 ?

Actually it will fail for even a small value such as 23. Why ? Because 23*23 is > 512 (The maximum number of threads that a GPU can have per block as of today!)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am struck up with Matrix multiplication on CUDA. The resultant product matrix is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply