Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7495183
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 29, 20262026-05-29T17:59:43+00:00 2026-05-29T17:59:43+00:00

The idea of my simple program that I’ve been trying to write is to

  • 0

The idea of my simple program that I’ve been trying to write is to take input from the user to see how large of a matrix to multiply.

dd@cuda-Linux:~/Desktop/multi$ ./program
What is the rowSize of a? 33
What is the colSize of a? 33
What is the rowSize of b? 33
What is the colSize of b? 33
Would you like to write the results to a file?(y or n)
y
Creating the random numbers now
Writing Matrix A to file now...
Writing Matrix B to file now...
Starting it on the device
Writing Matrix C to file now...
Finish

However the problems lies in my thread calculations. I can go to a 32×32 matrix and it will run fine and give me the correct results. However once I run a 33×33 I get results like the following:
[Matrix A] x [Matrix B] = [Matrix C] (linked to them instead of pasting several huge matrices into this post. But with matrix c you can see that half way through it starts to write the wrong numbers. My graphics card has a limit of 1024 threads which is a 32×32 matrix. Also when I go to run a 100×100 matrix Matrix C is all 0s.

Let mem_size_X be sizeof(float) * size_X, and size_X is height*width of the matrix. Right now the height and width has to be the same thus 32×32. Also the “block_size” is just the height. So with a 32×32 matrix the block size corresponds to 32.
Host code(launching):

    float* deviceMatrixA;
    float* deviceMatrixB;
    cudaMalloc((void**) &deviceMatrixA, mem_size_A);//allocate mem_size_x on the device. 
    cudaMalloc((void**) &deviceMatrixB, mem_size_B);


    cudaMemcpy(deviceMatrixA, a.elements, mem_size_A, cudaMemcpyHostToDevice);
    cudaMemcpy(deviceMatrixB, b.elements, mem_size_B, cudaMemcpyHostToDevice);



    int size_C = c.rowSize * c.colSize;
    int mem_size_C = sizeof(float) * size_C;
    c.elements = (float*) malloc(mem_size_C);


    float* deviceMatrixC;
    cudaMalloc((void**) &deviceMatrixC, mem_size_C);


    dim3 threads(block_size, block_size);
    dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);



    matrixMul<<< grid, threads,2*block_size*block_size*sizeof(float)>>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize, block_size);//sizeof(float)*block_size*block_size
    cudaThreadSynchronize();

The kernel code:

// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int aBegin = wA * block_size * by;
    int aEnd   = aBegin + wA - 1;
    int aStep  = block_size;

    int bBegin = block_size * bx;

    int bStep  = block_size * wB;
    float Csub=0;

    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        extern __shared__ float As[];
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        smem[ty*block_size+tx] = A[a + wA * ty + tx];

        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];

        __syncthreads();

        for (int k = 0; k < block_size; ++k)
            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;

        __syncthreads();
    }

    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

Thanks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-29T17:59:45+00:00Added an answer on May 29, 2026 at 5:59 pm

    As I told you on your earlier, almost identical question, this matrix multiply code is only designed to do calculations on matrices whose dimensions are a round multiple of block_size. If you choose block_size=32, then it can only be used for 32×32, 64×64, 96×96, 128×128, etc. Nothing you have done with dynamically allocated shared memory changes this.

    To verify that this is the case, let’s start with a complete, compilable repro case which will run your kernel, check whether it executed and compare its output to a simple reference calculation done on the host. This code is your posted kernel, plus the core of your launch parameter calculations. It will read a size from stdin and then run the case. If the results differ by more than a certain tolerance, an assert error will be raised. Here is the code, it should compile on CUDA 3.0 or later and run on any CUDA compatible GPU:

    #include <assert.h>
    #include <cstdio>
    #include <cstdlib>
    #include <cmath>
    
    inline void GPUassert(cudaError_t code, char * file, int line, bool Abort=true)
    {
        if (code != 0) {
            fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
            if (Abort) exit(code);
        }       
    }
    
    #define GPUerrchk(ans) { GPUassert((ans), __FILE__, __LINE__); }
    
    __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB, size_t block_size)
    {
        int bx = blockIdx.x;
        int by = blockIdx.y;
        int tx = threadIdx.x;
        int ty = threadIdx.y;
    
        int aBegin = wA * block_size * by;
        int aEnd   = aBegin + wA - 1;
        int aStep  = block_size;
        int bBegin = block_size * bx;
        int bStep  = block_size * wB;
    
        float Csub=0.f;
        for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
        {
            extern __shared__ float smem[];
    
            smem[ty*block_size+tx] = A[a + wA * ty + tx];
            smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];
    
            __syncthreads();
    
            for (int k = 0; k < block_size; ++k)
                Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
    
            __syncthreads();
        }
    
        int c = wB * block_size * by + block_size * bx;
        C[c + wB * ty + tx] = Csub;
    }
    
    inline float frand(){
        return (float)rand()/(float)RAND_MAX;
    }
    
    void matmul(float *C, const float *A, const float *B,  int wA, int wB)
    {
        for(int k=0; k<wB; k++) {
            for(int j=0; j<wB; j++) {
                float dotp = 0.f;
                for(int i=0; i<wA; i++) {
                    dotp += A[j*wA+i] * B[i*wB+k];
                }
                C[j*wB+k] = dotp;
            }
        }
    }
    
    int main(int argc, char ** argv)
    {
        int val = 128;
    
        if ( argc == 2 ) {
            val = atoi(argv[1]);
        }
    
        int m = val, n = val, mn = m*n;
        size_t sz = size_t(mn) * sizeof(float);
    
        srand(time(NULL));
    
        float * A = new float[mn], * B = new float[mn], * C= new float[mn];
        float * A_, * B_, * C_;
    
        for(int i=0; i<mn; i++) {
            A[i] = frand(); B[i] = frand();
        }
    
        GPUerrchk( cudaMalloc((void **)&A_, sz) );
        GPUerrchk( cudaMalloc((void **)&B_, sz) );
        GPUerrchk( cudaMalloc((void **)&C_, sz) );
    
        GPUerrchk( cudaMemcpy(A_, A, sz, cudaMemcpyHostToDevice) );
        GPUerrchk( cudaMemcpy(B_, B, sz, cudaMemcpyHostToDevice) );
    
        // Launch configuration
        // Note that the input matrice sizes *must* be a round
        // multiple of blocksize for this code to work correctly.
        const int blocksize=16;
        const int shmsz = size_t(2*blocksize*blocksize) * sizeof(float);
        dim3 block=dim3(blocksize,blocksize), grid = dim3(m/block.x,m/block.y); 
    
        matrixMul<<<grid,block,shmsz>>>(C_,A_,B_,m,n,blocksize);
        GPUerrchk( cudaPeekAtLastError() );
    
        GPUerrchk( cudaMemcpy(C, C_, sz, cudaMemcpyDeviceToHost) );
    
        // Verfication on host
        float * Cref = new float[mn];
        matmul(Cref,A,B,m,n); 
        const float tol = 5e-5f;
        for(int i=0; i<mn; i++) {
            assert(fabs(C[i]-Cref[i])/C[i] < tol);
        }
    
        GPUerrchk( cudaThreadExit() ); // CUDA 3.2 compatible
    
        return 0;
    }
    

    So now, let’s run this code for different sizes. To verify that the code on the GPU didn’t do anything wrong, I will run it using the cuda-memcheck utility, which can detect out of bounds memory access. All of the following tests were made on an OS X 10.6 machine with a compute capability 1.2 card and CUDA 3.2, using blocksize=16 :

    $ nvcc -arch=sm_12 -Xcompiler="-Wall" -Xptxas="-v" -o matmul2 matmul2.cu
    ptxas info    : Compiling entry function '_Z9matrixMulPfS_S_iim' for 'sm_12'
    ptxas info    : Used 16 registers, 32+16 bytes smem, 4 bytes cmem[1]
    

    Let’s try a case where the matrices are less than blocksize first

    $ cuda-memcheck ./matmul2 4
    ========= CUDA-MEMCHECK
    GPUassert: invalid configuration argument matmul2.cu 101
    ========= ERROR SUMMARY: 0 errors
    

    Here we failed to run the kernel with an invalid configuration argument error. Why? Because of this:

        dim3 block=dim3(blocksize,blocksize), grid = dim3(m/block.x,m/block.y); 
    

    which results in a 0 grid size when m,n < blocksize.

    Next let’s try the smallest round multiple of blocksize, in this case 16:

    $ cuda-memcheck ./matmul2 16
    ========= CUDA-MEMCHECK
    ========= ERROR SUMMARY: 0 errors
    

    which runs without error, or assert failure. Let’s now increase the size to 17:

    cuda-memcheck ./matmul2 17
    ========= CUDA-MEMCHECK
    GPUassert: unspecified launch failure matmul2.cu 103
    ========= Invalid __global__ read of size 4
    =========     at 0x000001f8 in matrixMul
    =========     by thread (0,2,0) in block (0,0)
    =========     Address 0x001009c8 is out of bounds
    =========
    ========= ERROR SUMMARY: 1 error
    

    and we get out of bounds memory access detected and a launch failure error, which is expected. Now lets try 64, 96, and 128:

    $ cuda-memcheck ./matmul2 64
    ========= CUDA-MEMCHECK
    ========= ERROR SUMMARY: 0 errors
    
    $ cuda-memcheck ./matmul2 96
    ========= CUDA-MEMCHECK
    ========= ERROR SUMMARY: 0 errors
    
    $ cuda-memcheck ./matmul2 128
    ========= CUDA-MEMCHECK
    ========= ERROR SUMMARY: 0 errors
    

    and finally let’s try 129:

    $ cuda-memcheck ./matmul2 129
    ========= CUDA-MEMCHECK
    GPUassert: unspecified launch failure matmul2.cu 103
    ========= Invalid __global__ read of size 4
    =========     at 0x000001f8 in matrixMul
    =========     by thread (0,1,0) in block (0,0)
    =========     Address 0x00120904 is out of bounds
    =========
    ========= ERROR SUMMARY: 1 error
    

    Even if you don’t follow why the out of bounds errors are occurring, are you at least willing to accept that this code really does only work correctly for matrices which are round multiples of blocksize?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

The idea of my simple program that I've been trying to write is to
I need to write a simple terminal-based program that should, Read some text from
it's a week that I'm trying to write a program that download correctly from
I am trying to create a simple MongooseJS example program that gets a list
I have a very simple program that I am trying to improve performance. One
i'm trying to create a simple Opencv program in C that creates a file
I am currently trying to make a very simple C program for school that
I have an idea for a simple program to make that will help me
I have a simple Java program that allows a user to draw rectangles on
I'm trying to code a simple program that uses an ifstream and scanner to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.