Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8936979
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T10:19:44+00:00 2026-06-15T10:19:44+00:00

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without

  • 0

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn’t seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):

#include <stdio.h>
#include <cuda.h>

__global__ void func(int *a, int N) {
    int idx = threadIdx.x;

    if (idx < N) {
        a[idx] *= -1;
        __threadfence_system();
    }
}

int main(void) {
    int *a, *a_gpu;
    const int N = 8;
    size_t size = N*sizeof(int);

    cudaSetDeviceFlags(cudaDeviceMapHost);
    cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
    cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);

    for (int i = 0; i < N; i++) {
        a[i] = i;
    }
    for (int i = 0; i < N; i++) {
        printf("%i ", a[i]);
    }
    printf("\n");

    func<<<1, N>>>(a_gpu, N);
    // cudaDeviceSynchronize();

    for (int i = 0; i < N; i++) {
        printf("%i ", a[i]);
    }
    printf("\n");

    cudaFreeHost(a);
}

I’m compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T10:19:45+00:00Added an answer on June 15, 2026 at 10:19 am

    A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it’s no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.

    As described here:

    In order to facilitate concurrent execution between host and device,
    some function calls are asynchronous: Control is returned to the host
    thread before the device has completed the requested task
    . These are:

    • Kernel launches;
    • Memory copies between two addresses to the same device memory;
    • Memory copies from host to device of a memory block of 64 KB or less;
    • Memory copies performed by functions that are suffixed with Async;
    • Memory set function calls.

    This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don’t want this behavior, a simple solution as you’ve already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don’t need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it’s copy operation.

    I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It’s possible to do this:

    “Programmers can globally disable asynchronous kernel launches for all
    CUDA applications running on a system by setting the
    CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
    provided for debugging purposes only and should never be used as a way
    to make production software run reliably. “

    If you want this synchronous behavior, it’s better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it’s really not a good idea.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Using different streams for CUDA kernels makes concurrent kernel execution possible. Therefore n kernels
It it possible to use OpenMP pragmas in CUDA source files (not in kernel
is it possible to check if any CUDA devices are present before all cudaMalloc...
As I know, CUDA has a stream function. It make it possible that memory
Is it possible to stop all running processing using the GPU via CUDA, without
Is it indeed possible to allocate multiple shared arrays in CUDA Fortran without having
Possible Duplicate: Objective C for Windows iPhone development on Windows Is there any way
I am writing a CUDA kernel for Histogram on a picture, but I had
CUDA experts, if I have defined in the host code a new type: struct
Pinned or page-locked memory is transferred faster to GPUs compared to not-locked memory. CUDA

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.