Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8295809
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T14:44:19+00:00 2026-06-08T14:44:19+00:00

I am trying to understand the CUDA programming model and its features.As an exercise,

  • 0

I am trying to understand the CUDA programming model and its features.As an exercise, I am trying to convert the following loop structure with a function call into an efficient CUDA kernel

//function call
bool gmul(int rowsize,int *Ai,int *Bj,int colsize)
{
    for(int i = 0;i < rowsize;i++)
    {
        for(int j = 0;j < colsize;j++)
        {
            if(Ai[i] == Bj[j])
            {
                return true;
            }
        }
    }
    return false;
}

//Some for loop in main function is as follows

for(i = 0;i < q ;i++)
    {
        cbeg = Bjc[i];
        cend = Bjc[i+1];        
        for(j = 0;j < m;j++)
        {
            beg = Aptr[j];
            end = Aptr[j+1];            
            if(gmul(end - beg,Acol + beg,Bir + cbeg,cend - cbeg))
            {   
                temp++;             
            }                       
        }
        Cjc1[i+1] = temp ;              
    } 

And my kernel with function call is as follows.

    __device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
    {       
        for(int j = 0; j < rowsize;j++)
        {           
           for(int k = 0;k < colsize;k++)
            {   
              if(Aj[j] == Bi[k])
               {    
                return true;
                }                               
            }           
        }
            return false;       
    }


__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *count,int *Cjc)
    {
        int tid = threadIdx.x + blockIdx.x * blockDim.x;
        int i;
        if(tid < cols)
        {
            int beg = Bptr[tid];
            int end = Bptr[tid+1];
            for(i = 0;i < rows;i++)
            {
                int cbeg = Aptr[i];
                int cend = Aptr[i+1];
                if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg,count))
                {
                    //atomicAdd(count,1);
                                    //Changes made are in next line
                              atomicAdd(Cjc+tid+1,1);           
                }
            }
            //atomicAdd(Cjc+tid+1,*count);              
        }               
    }

What I want is that whenever __device__ mult is returned with true value, my global kernel function should increment the counter for that particular thread and once the for loop (in kernel function) ends,it should store the value into Cjc array and count is handed over to other threads for increment operation. However, I am not getting the expected value. All I get in this Cjc array is the final count once all the threads have finished executing.

I am using GTX 480 with CC 2.0

Any suggestions/hints as to why am I getting wrong answers or optimizations for this CUDA kernel will be appreciated.
Thanks in advance.
********Solved***********

Right now,I am facing an issue that whenever I reach the size of 4000 and beyond, I am getting the value of all elements in an array as 0. Here is how I launch the kernel.

    int numBlocks,numThreads;

        if(q % 32 == 0)
        {
            numBlocks = q/32;
            numThreads = 32;
        }
        else
        {
            numBlocks = (q+31)/32;
            numThreads = 32;
        }
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);          

I was wondering I am crossing any limits for block or grid dimensions but for CC 2.0, I think I am just right to launch the sufficient blocks and threads that dont cross any limits. I wonder why still all the answers are coming out as 0.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T14:44:20+00:00Added an answer on June 8, 2026 at 2:44 pm

    You have written parallel threads that read and write count without synchronization. The threads run concurrently in an unpredictable order, so threads atomically modify count and read count in an unpredictable order. The expression *count will produce different results depending on the exact execution order.

    once the for loop (in kernel function) ends, it should store the value into Cjc array and count is handed over to other threads for increment operation.

    There is no synchronization, so no thread waits for another thread to finish the loop. Instead of making all threads share the same storage for count, why not give each thread a different piece of storage? Then the threads will not influence one another’s result. You can run a scan kernel after this one to combine results.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Trying to understand Ruby a bit better, I ran into this code surfing the
Trying to understand what Sql Profiler means by emitting sp_reset_connection. I have the following,
Trying to understand why this doesn't work. I keep getting the following errors: left
Trying to understand resources in java-land. I believe the following is true: Resources loaded
Trying to understand how to link a function that is defined in a struct,
Im trying to understand when to call autorelease, and what this will actually do
Trying to understand the concepts of setting constant speed on game loop. My head
Trying to understand the effects of :has_many as its introduced in the Agile Web
I am trying understand why the .wrap() function in my basic table of contents
Trying to understand the Android framework model. I have an application that needs to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.