Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8771239
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T17:44:07+00:00 2026-06-13T17:44:07+00:00

I am teaching myself CUDA with the programming guide that CUDA offers. For practice,

  • 0

I am teaching myself CUDA with the programming guide that CUDA offers. For practice, I made a simple kernel that determines the maximum of an array and returns it to the CPU:

  __global__ void getTheMaximum(float* myArrayFromCPU, float* returnedMaximum) {
    // Store my current value in shared memory.
    extern __shared__ float sharedData[];
    sharedData[threadIdx.x] = myArrayFromCPU[threadIdx.x];

    // Iteratively calculate the maximum.
    int halfScan = blockDim.x / 2;
    while (halfScan > 0 && threadIdx.x < halfScan) {
      if (sharedData[threadIdx.x] < sharedData[threadIdx.x + halfScan]) {
        sharedData[threadIdx.x] = sharedData[threadIdx.x + halfScan];
      }
      halfScan = halfScan / 2;
    }

    // Put maximum value in global memory for later return to CPU.
    returnedMaximum[0] = sharedData[0];
  }

myArrayFromCPU is an array of float values of size 1024. returnedMaximum is a trivial array containing a single item: The computed maximum value.

My idea for this algorithm is that it would iteratively determine the maximum as it whittles down values from half the block size beyond the current value.

However, when I run this code, I get unreliable output. The returned maximum value varies. Why is that? How can a single algorithm produce different values every time?

Update:

I’m also just running on a single block. I guarantee this by setting a 1-dimensional block size of X=1024.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T17:44:08+00:00Added an answer on June 13, 2026 at 5:44 pm

    It is not guaranteed that all threads of the whole block execute at the very same moment in time. That guarantee you have only within a single warp (group of 32 threads).

    To avoid concurrency hazards within a block – you can use __syncthreads() intrinsic function which halts the threads reaching it until all reach the point.
    Note, you should not put __syncthreads() in a branching code where you cannot guarantee that all threads will reach the spot uniformly.

    Try the following loop:

    __syncthreads();
    while (halfScan > 0) {
      if (threadIdx.x < halfScan) {
        if (sharedData[threadIdx.x] < sharedData[threadIdx.x + halfScan]) {
          sharedData[threadIdx.x] = sharedData[threadIdx.x + halfScan];
        }
      }
      __syncthreads();
      halfScan = halfScan / 2;
    }
    

    Do note, that I removed the condition threadIdx.x < halfScan from the while loop, because I want all threads to execute __syncthreads() at the same spot and the same number of times.

    Also, __syncthreads() before the loop may help to ensure that the load from myArrayFromCPU is complete (for all threads) before the loop starts.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm teaching myself programming and today's challenge is to write a program that can
I've been teaching myself programming from scratch by creating a simple app in Objective
I am teaching myself machine learning now. One simple question regarding information gain. How
I'm teaching myself iOS 5 programming through the Stamford University CS193P class, I just
I am teaching myself Objective-C and iOS programming with IOS Programming: The Big Nerd
I've been teaching myself C++ and someone told me that C++ does not have
I'm teaching myself techniques of programming with exception safety mode of ;) and I
I’m teaching myself php and I’m stuck with representing a data array coming from
I've been teaching myself WPF through Sells/Griffiths' Programming WPF, and I've found it a
Teaching myself C and finding that when I do an equation for a temp

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.