Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8994551
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T23:27:29+00:00 2026-06-15T23:27:29+00:00

I have some questions. Recently I’m making a program by using CUDA. In my

  • 0

I have some questions.

Recently I’m making a program by using CUDA.

In my program, there is one big data on Host programmed with std::map(string, vector(int)).

By using these datas some vector(int) are copied to GPUs global memory and processed on GPU

After processing, some results are generated on GPU and these results are copied to CPU.

These are all my program schedule.

  1. cudaMemcpy( … , cudaMemcpyHostToDevice)
  2. kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
  3. cudaMemcpy( … , cudaMemcpyDeviceToHost)
  4. repeat 1~3steps 1000times (for another data(vector) )

But I want to reduce processing time.

So I decided to use cudaMemcpyAsync function in my program.

After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.

But my programs are using std::map, so I couldn’t make this std::map data to pinned memory.

So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.

Finally, my program worked like this.

  1. Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
  2. cudaMemcpyAsync( … , cudaMemcpyHostToDevice)
  3. kernel(kernel function only can be executed when whole data is copied to GPU global memory)
  4. cudaMemcpyAsync( … , cudaMemcpyDeviceToHost)
  5. repeat 1~4steps 1000times (for another data(vector) )

And my program became much faster than the previous case.

But problem(my curiosity) is at this point.

I tried to make another program in a similar way.

  1. Memcpy (copy data from std::map to buffer only for one vector)
  2. cudaMemcpyAsync( … , cudaMemcpyHostToDevice)
  3. loop 1~2 until whole data is copied to GPU global memory
  4. kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
  5. cudaMemcpyAsync( … , cudaMemcpyDeviceToHost)
  6. repeat 1~5steps 1000times (for another data(vector) )

This method came out to be about 10% faster than the method discussed above.

But I don’t know why.

I think cudaMemcpyAsync only can be overlapped with kernel function.

But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.

Sorry for my long question but I really want to know why.

Can Someone teach or explain to me what is the exact facility “cudaMemcpyAsync” and what functions can be overlapped with “cudaMemcpyAsync” ?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T23:27:30+00:00Added an answer on June 15, 2026 at 11:27 pm

    The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.

    As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it’s necessary to use some additional concepts. For best overlap opportunities, we need:

    1. Host memory buffers that are pinned, e.g. via cudaHostAlloc()
    2. Usage of cuda streams to separate various types of activity (data copy and kernel computation)
    3. Usage of cudaMemcpyAsync (instead of cudaMemcpy)

    Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.

    The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:

    1. It can be issued in any stream (it takes a stream parameter)
    2. Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.

    Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.

    Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:

    1. Overlap copy/compute section of the CUDA best practices guide.
    2. Sample code showing a basic implementation of copy/compute overlap.
    3. Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.

    Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

There have been some questions on SO recently on ADL that have got me
I have some questions about importing data from Excel/CSV File into SQL Server. Let
I have some questions about using MySQLi queries, and related memory management. Suppose I
Hi I have recently started to learn C# and have some questions regarding properties.
I've been looking at some server logs using tail -f recently, and have thought
I've been working on some solutions for some questions that have been recently posted
I have recently started looking into MQs and I had some very basic questions,
I have Java experience and recently am doing some C++ coding. My question is
I have some questions concering routing with Codeigniter. What I´m doing now is the
I have some questions regarding the following css that I found: html, body {

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.