I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only

Question

0

Asked: May 30, 20262026-05-30T19:46:25+00:00 2026-05-30T19:46:25+00:00

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only

0

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.

My kernel looks like the following:

__global__ void
krnlSpMVmul1(
        float *data_mat,
        int num_nonzeroes,
        unsigned int *row_ptr,
        float *data_vec,
        float *data_result)
{
    extern __shared__ float local_result[];
    local_result[threadIdx.x] = 0;

    float vector_elem = data_vec[blockIdx.x];

    unsigned int start_index = row_ptr[blockIdx.x];
    unsigned int end_index = row_ptr[blockIdx.x + 1];

    for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
        local_result[threadIdx.x] += (data_mat[index] * vector_elem);

    __syncthreads();

   // Reduction

   // Writing accumulated sum into result vector
}

As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.

Now to my problem:
Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.

Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.

Help is very much appreciated!

Edit towards talonmies comment:

I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:

The reduction part:

int k = blockDim.x / 2;
while (k != 0)
{
    if (threadIdx.x < k)
        local_result[threadIdx.x] += local_result[threadIdx.x + k];
    else
        return;

    __syncthreads();

    k /= 2;
}

and how I write the results back to global memory:

data_result[blockIdx.x] = local_result[0];

Thats all I got.

Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:

float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }

And thats an excerpt of my wrapper function:

float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;

// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));

// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
        dev_data_mat,
        data_mat,
        num_nonzeroes * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_row_ptr,
        row_ptr,
        num_row_ptr * sizeof(unsigned int),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_vec,
        data_vec,
        dim_x * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_result,
        data_result,
        dim_y * sizeof(float),
        cudaMemcpyHostToDevice));

// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);

// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
        dev_data_mat,
        num_nonzeroes,
        dev_row_ptr,
        dev_data_vec,
        dev_data_result);

I hope this is straightforward but will explain things if it is of any interest.

One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T19:46:27+00:00

Your dynamically allocated shared memory size is incorrect. Right now you are doing this:

krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(.....)

The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure – you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.

You should be doing something like this:

krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE * sizeof(float)>>>(.....)

That will give you the correct dynamic shared memory size and should eliminate the problem.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply