Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7601259
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T23:07:29+00:00 2026-05-30T23:07:29+00:00

My problem is the following: I have an image in which I detect some

  • 0

My problem is the following: I have an image in which I detect some points of interest using the GPU. The detection is a heavyweight test in terms of processing, however only about 1 in 25 points pass the test on average. The final stage of the algorithm is to build up a list of the points. On the CPU this would be implemented as:

forall pixels x,y
{
    if(test_this_pixel(x,y))
        vector_of_coordinates.push_back(Vec2(x,y));
}

On the GPU I have each CUDA block processing 16×16 pixels. The problem is that I need to do something special to eventually have a single consolidated list of points in global memory. At the moment I am trying to generate a local list of points in shared memory per block which eventually will be written to global memory. I am trying to avoid sending anything back to the CPU because there are more CUDA stages after this.

I was expecting that I could use atomic operations to implement the push_back function on shared memory. However I am unable to get this working. There are two issues. The first annoying issue is that I am constantly running into the following compiler crash:

nvcc error : 'ptxas' died with status 0xC0000005 (ACCESS_VIOLATION)

when using atomic operations. It is hit or miss whether I can compile something. Does anyone know what causes this?

The following kernel will reproduce the error:

__global__ void gpu_kernel(int w, int h, RtmPoint *pPoints, int *pCounts)
{
    __shared__ unsigned int test;
    atomicInc(&test, 1000);
}

Secondly, my code which includes a mutex lock on shared memory hangs the GPU and I don’t understand why:

__device__ void lock(unsigned int *pmutex)
{
    while(atomicCAS(pmutex, 0, 1) != 0);
}

__device__ void unlock(unsigned int *pmutex)
{
    atomicExch(pmutex, 0);
}

__global__ void gpu_kernel_non_max_suppress(int w, int h, RtmPoint *pPoints, int *pCounts)
{
    __shared__ RtmPoint localPoints[64];
    __shared__ int localCount;
    __shared__ unsigned int mutex;

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    int threadid = threadIdx.y * blockDim.x + threadIdx.x;
    int blockid = blockIdx.y * gridDim.x + blockIdx.x;

    if(threadid==0)
    {
        localCount = 0;
        mutex = 0;
    }

    __syncthreads();

    if(x<w && y<h)
    {
        if(some_test_on_pixel(x,y))
        {
            RtmPoint point;
            point.x = x;
            point.y = y;

            // this is a local push_back operation
            lock(&mutex);
            if(localCount<64) // we should never get >64 points per block
                localPoints[localCount++] = point;
            unlock(&mutex);
        }
    }

    __syncthreads();

    if(threadid==0)
        pCounts[blockid] = localCount;
    if(threadid<localCount)
        pPoints[blockid * 64 + threadid] = localPoints[threadid];
}

In the example code at this site, the author manages to successfully use atomic operations on shared memory, so I am confused as to why my case does not function. If I comment out the lock and unlock lines, the code runs ok, but obviously incorrectly adding to the list.

I would appreciate some advice about why this problem is happening and also perhaps if there is a better solution to achieving the goal, since I am concerned anyway about the performance issues with using atomic operations or mutex locks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T23:07:30+00:00Added an answer on May 30, 2026 at 11:07 pm

    I suggest using prefix-sum to implement that part to increase parallelism. To do that you need to use a shared array. Basically prefix-sum will turn an array (1,1,0,1) into (0,1,2,2,3), i.e., will calculate an in-place running exclusive sum so that you’ll get per-thread write indices.

    __shared__ uint8_t vector[NUMTHREADS];
    
    ....
    
    bool emit  = (x<w && y<h);
         emit  = emit && some_test_on_pixel(x,y);
    __syncthreads();
    scan(emit, vector);
    if (emit) {
         pPoints[blockid * 64 + vector[TID]] = point;
    }
    

    prefix-sum example:

        template <typename T>
    __device__ uint32 scan(T mark, T *output) {
    #define GET_OUT (pout?output:values)
    #define GET_INP (pin?output:values)
      __shared__ T values[numWorkers];
      int pout=0, pin=1;
      int tid = threadIdx.x;
    
      values[tid] = mark;
    
      syncthreads();
    
      for( int offset=1; offset < numWorkers; offset *= 2) {
        pout = 1 - pout; pin = 1 - pout;
        syncthreads();
        if ( tid >= offset) {
          GET_OUT[tid] = (GET_INP[tid-offset]) +( GET_INP[tid]);
        }
        else {
          GET_OUT[tid] = GET_INP[tid];
        }
        syncthreads();
      }
    
      if(!pout)
        output[tid] =values[tid];
    
      __syncthreads();
    
      return output[numWorkers-1];
    
    #undef GET_OUT
    #undef GET_INP
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have the following simple button with an image as the template. The problem
I have the following problem using subversion: I'm currently working on the trunk of
I have the following problem using template instantiation [*]. file foo.h class Foo {
I have the following problem in my Data Structures and Problem Solving using Java
I have the following problem. If I query values with a keyfigure which is
I have a problem in the following code: //quesry the db for image information
I have the following problem which I'm trying to solve with javascript. I have
I have the following Execution statement which creates a table (using data from another
I have following problem: In my titanium project I save an image ( from
I have the following problem: i have some data loaded in my application, that

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.