Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8856613
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T14:23:43+00:00 2026-06-14T14:23:43+00:00

Im trying write program in CUDA but I have problem with synchronization in the

  • 0

Im trying write program in CUDA but I have problem with synchronization in the same block between threads.

Here is model situation:

 10 __global__ void gpu_test_sync()
 11 {
 12     __shared__ int t;
 13     int tid = threadIdx.x;
 14
 15     t = 0;
 16     __threadfence();
 17     __syncthreads();
 18
 19     // for(int i=0; i<1000000 && t<tid; i++); // with fuse
 20     while(t<tid);
 21
 22     t++;
 23     __threadfence();
 24 }
 25
 26 void f_cpu()
 27 {
 28     printf("TEST ... ");
 29     int blocks = 1;
 30     int threads = 2;
 31     gpu_test_sync<<< blocks , threads >>>();
 32     printf("OK\n");
 33 }

If threads = 1, everything is ok. If threads > 1, infinite cycling.

Why? Function __threadfence(); should make visible value of t variable for other threads.

How I can solve it?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T14:23:44+00:00Added an answer on June 14, 2026 at 2:23 pm

    I don’t believe your kernel will be able to do what you are trying to do because of the divergent branch in while(t<tid) causing all threads of the warp to loop indefinitely and never arriving at the line ++t.

    Long explanation

    scroll to ‘The important part’ for the important stuff if you already know about threads and blocks and warps:

    (I have no experience with the Kepler architecture, yet. Some of these numbers may be different if not using Fermi.)

    Some terms need to be explained to understand the next section:
    The following terms relate to the logical (logical as in software constructs) threads:

    • thread – a single thread of execution.
    • block – a group of multiple threads that execute the same kernel.
    • grid – a group of blocks.

    The following terms relate to the physical (physical as in hardware architecture dependent) threads:

    • core – a single compute core, one core runs exactly one instruction at a time.
    • warp – a group of threads that execute in parallel on the hardware, a warp consists of 32 threads on current generation CUDA hardware.

    Kernels are executed by one or more Streaming Multiprocessors (SM). A typical
    mid-to-high-end GeForce card from the Fermi family (GeForce 400 and GeForce
    500 series) has 8-16 SMs on a single GPU[Fermi whitepaper]. Each SM consists of 32 CUDA Cores (cores). Threads are scheduled for execution by the warp schedulers, each SM has
    two warp scheduler units that work in a lockstep fashion. The smallest unit that
    a warp scheduler can schedule is called a warp, which consists of 32 threads on all
    CUDA hardware released so far at the time of writing. Only one warp may execute
    at a time on each SM.

    Threads in CUDA are much more lightweight than CPU threads, context switches
    are cheaper and all threads of a warp execute the same instruction or have to
    wait while the other threads in the warp execute the instruction, this is called Sin-
    gle Instruction Multiple Thread (SIMT) and is similar to traditional CPU Single
    Instruction Multiple Data (SIMD) instructions such as SSE, AVX, NEON, Al-
    tivec etc., this has consequences when using conditional statements as described
    further down.

    To allow for problems which demand more than 32 threads to solve the CUDA
    threads are arranged into logical groups called blocks and grids of sizes that are
    defined by the software developer. A block is a 3-dimensional collection of threads,
    each thread in the block has its own individual 3-dimensional identification num-
    ber to allow the developer to distinguish between the threads in the kernel code.
    Threads within a single block can share data through shared memory, this reduces
    the load on global memory. Shared memory has a much lower latency than global
    memory but is a limited resource, the user can choose between (per block) 16 kB
    shared memory and 48 kB L1 cache or 48 kB shared memory and 16 kB L1 cache.

    Several blocks of threads in turn can be grouped into a grid. Grids are 3-dimensional
    arrays of blocks. The maximum block size is tied to the available hardware resources while the grids can be of (almost) arbitrary size. Blocks within a grid can
    only share data through global memory, which is the on-GPU memory which has
    the highest latency.

    A Fermi GPU can have 48 warps (1536 threads) active at once per SM, given
    that the threads use little enough local and shared memory to fit all at the same
    time. Context switches between threads are fast since registers are allocated to the
    threads and hence there is no need for saving and restoring registers and shared
    memory between thread switches. The result is that it is actually desired to over-
    allocate the hardware since it will hide memory stalls inside the kernels by letting
    the warp schedulers switch the currently active warp whenever a stall occurs.

    The important part

    The thread warp is a hardware group of threads that execute on the same Streaming Multiprocessor (SM).
    Threads of a warp can be compared to sharing a common program counter between
    the threads, hence all threads must execute the same line of program code. If the
    code has some brancing statements such as if ... then ... else the warp must
    first execute the threads that enter the first block, while the other threads of the
    warp wait, next the threads that enter the next block will execute while the other
    threads wait and so on. Because of this behaviour conditional statements should
    be avoided in GPU code if possible. When threads of a warp follow different lines
    of execution it is known as having divergent threads. While conditional blocks
    should be kept to a minimum inside CUDA kernels, it is sometimes possible to
    reorder statements so that all threads of the same warp follow only a single path
    of execution in an if ... then ... else block and mitigate this limitation.

    The while and for statements are branching statements, so it is not limited to if.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

now, i'm trying to write program using php and sqlite, but i have problem.
I am trying to write a program that displays the integers between 1 and
Im trying to write a program to read a text file through args but
I'm trying to write a program to automate a ticket draft. We have a
I am trying to write a program with Visual Basic 2010. The problem is
I was trying to write a program that would display the prime numbers between
I'm trying to write a program that can password protect individual folders, but I'm
I am trying to write a program to simulate a card game. I have
I am trying to write a program like bouncingBall. but i generated N obstacles
I'm trying to write a program that uses fattach but the compiler says fattach

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.