Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7587823
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T19:46:25+00:00 2026-05-30T19:46:25+00:00

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only

  • 0

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.

My kernel looks like the following:

__global__ void
krnlSpMVmul1(
        float *data_mat,
        int num_nonzeroes,
        unsigned int *row_ptr,
        float *data_vec,
        float *data_result)
{
    extern __shared__ float local_result[];
    local_result[threadIdx.x] = 0;

    float vector_elem = data_vec[blockIdx.x];

    unsigned int start_index = row_ptr[blockIdx.x];
    unsigned int end_index = row_ptr[blockIdx.x + 1];

    for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
        local_result[threadIdx.x] += (data_mat[index] * vector_elem);

    __syncthreads();

   // Reduction

   // Writing accumulated sum into result vector
}

As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.

Now to my problem:
Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.

Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.

Help is very much appreciated!


Edit towards talonmies comment:

I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:

The reduction part:

int k = blockDim.x / 2;
while (k != 0)
{
    if (threadIdx.x < k)
        local_result[threadIdx.x] += local_result[threadIdx.x + k];
    else
        return;

    __syncthreads();

    k /= 2;
}

and how I write the results back to global memory:

data_result[blockIdx.x] = local_result[0];

Thats all I got.

Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:

float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }

And thats an excerpt of my wrapper function:

float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;

// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));

// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
        dev_data_mat,
        data_mat,
        num_nonzeroes * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_row_ptr,
        row_ptr,
        num_row_ptr * sizeof(unsigned int),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_vec,
        data_vec,
        dim_x * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_result,
        data_result,
        dim_y * sizeof(float),
        cudaMemcpyHostToDevice));

// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);

// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
        dev_data_mat,
        num_nonzeroes,
        dev_row_ptr,
        dev_data_vec,
        dev_data_result);

I hope this is straightforward but will explain things if it is of any interest.

One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T19:46:27+00:00Added an answer on May 30, 2026 at 7:46 pm

    Your dynamically allocated shared memory size is incorrect. Right now you are doing this:

    krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(.....)
    

    The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure – you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.

    You should be doing something like this:

    krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE * sizeof(float)>>>(.....)
    

    That will give you the correct dynamic shared memory size and should eliminate the problem.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

The setting is here. X: 6000x8000 non-sparse matrix B: 8000x1 sparse vector with only
I want to understand how to compute big-O for a dense versus sparse graph.
I'm writing a sparse matrix class in C++ in which every row and column
We have an application that stores a sparse matrix. This matrix has entries that
I want to multiply a sparse matrix A, with a matrix B which has
I have a large sparse matrix representing attributes for millions of entities. For example,
I have a sparse matrix that is not symmetric I.E. the sparsity is somewhat
I am using sparse matrices as a mean of compressing data, with loss of
I am using Scipy to construct a large, sparse (250k X 250k) co-occurrence matrix
We have two classes: template<typename T, typename Size, typename Stack, typename Sparse> class Matrix

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.