This is program for matrix multiplication on CUDA architecture. This code is working fine

Question

0

Asked: June 11, 20262026-06-11T00:26:50+00:00 2026-06-11T00:26:50+00:00

This is program for matrix multiplication on CUDA architecture. This code is working fine

0

This is program for matrix multiplication on CUDA architecture.
This code is working fine when size of array is 30 x 30 but giving output as a series of 0’s when size is greater.
I am using standard ec2 instance for CUDA hosted on linux machine. Can anybody figure out the reason ?

#include <stdio.h>
#define SIZE 30

__global__ void matrix_multiply(float *input1,float  *input2,float *output,int dimension){


    int input1_index = threadIdx.x / dimension * dimension;
    int input2_index =  threadIdx.x % dimension;
    int i=0;
    for( i =0; i <dimension; i++){
        output[threadIdx.x] += input1[input1_index + i] * input2[input2_index + i * dimension];
    }
}
int main(){
    int i,j,natural_number=1;
    float input1[SIZE][SIZE],input2[SIZE][SIZE],result[SIZE][SIZE]={0};
    float  *c_input1,*c_input2,*c_result;
    for(i=0;i<SIZE;i++){
        for(j=0;j<SIZE;j++){
            input1[i][j]=input2[i][j]=natural_number++;
        }
    }
    cudaMalloc((void**)&c_input1,sizeof(input1));
    cudaMalloc((void**)&c_input2,sizeof(input2));
    cudaMalloc((void**)&c_result,sizeof(result));
    cudaMemcpy(c_input1,input1,sizeof(input1),cudaMemcpyHostToDevice);
    cudaMemcpy(c_input2,input2,sizeof(input2),cudaMemcpyHostToDevice);
    cudaMemcpy(c_result,result,sizeof(result),cudaMemcpyHostToDevice);

    matrix_multiply<<<1,SIZE * SIZE>>>(c_input1,c_input2,c_result,SIZE);
    if(cudaGetLastError()!=cudaSuccess){
        printf("%s\n",cudaGetErrorString(cudaGetLastError()));
    }
    cudaMemcpy(result,c_result,sizeof(result),cudaMemcpyDeviceToHost);
    for(i=0;i<SIZE;i++){
        for(j=0;j<SIZE;j++){
            printf("%.2f ",result[i][j]);
        }
        printf("\n");
    }
    cudaFree(c_input1);
    cudaFree(c_input2);
    cudaFree(c_result); 
    return 0;
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T00:26:51+00:00

You probably have a max of 1024 threads per block on your GPU. 30 x 30 = 900, so that should be OK, but e.g. 40 x 40 would results in a kernel launch failure (take-home message: always check for errors !).

You probably want to consider organizing your data differently, e.g. SIZE blocks of SIZE threads and then call the kernel as:

matrix_multiply<<<SIZE, SIZE>>>(c_input1,c_input2,c_result,SIZE);

(Obviously you’ll need to modify your array indexing within the kernel code, e.g. use the block index as the row and the thread index as the column.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is program for matrix multiplication on CUDA architecture. This code is working fine

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply