Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8927429
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T08:13:19+00:00 2026-06-15T08:13:19+00:00

When running concurrent copy & kernel operations: If I have a kernel runTime that

  • 0

When running concurrent copy & kernel operations:
If I have a kernel runTime that is twice as long as a dataCopy operation, will I get 2 copies per kernel run?
The stream examples I’m seeing show a 1:1 relationship. (Time of copy = time of kernel run.) I’m wondering what happens when there is something different. Is there always one copy operation (max) for every kernel launch? Or does the copy operation run independent of the kernel launch? i.e. I could possibly complete 5 copy operations for every kernel launch, if the run & copy time work out that way.
(I’m trying to figure out how many copy operations to queue up before a kernel launch.)

One to one: (time to copy = kernel run time)
<–stream1Copy–><–stream2Copy–>
…………………………<-stream1Kernel->

Two to one: (time to copy = 1/2 kernel run time)
<-stream1Copy-><-stream2Copy-><-stream3Copy->
……………………….<———-stream1Kernel————>

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T08:13:19+00:00Added an answer on June 15, 2026 at 8:13 am

    You can have more than one copy per kernel launch. Only one copy (per direction on devices with dual copy engines) can be running at a particular time to a particular GPU, but once that one is complete, another can be started immediately. Asynchronous copies issued in streams other than the kernel launch stream in question will run completely asynchronously to that kernel launch, assuming niether stream is stream 0. (This also assumes you are using pinned memory i.e. cudaHostAlloc to create the relevant host-side buffers.)

    You may want to read the relevant section in the best practices guide.

    The reason you frequently see a 1:1 analysis of compute and copy is that it is assumed the copied data will be consumed by (or is produced by) the kernel call, and so logically we can think of the block of data this way. But if it’s easier to structure your code as a sequence of copies, there should be no problem with that. Naturally if you can batch up all your data into a single cudaMemcpy call, that will be slightly more efficient that a sequence of copies that are transferring the same data.

    The visual profiler will help you see exactly what is going on comparing data copy operations to kernel operations, in a timeline fashion.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm writing an application that has a number (hundreds) of concurrent network operations running
Running SQL Server 2008 (not R2). I have a few reports that have URLs
I'm getting abort preclean due to time when running Concurrent Mark & Sweep in
We are running tomcat application server that handle over 100 concurrent sessions. In the
I have a monorail web application running on iis7. It appears like two concurrent
I have a number of concurrent clients - i.e. threads running and doing something
I have question on controlling the amount of concurrent threads I want running. Let
I have been running a highly concurrent application on my HP Proliant Servers. The
Why don't I see a significant speed increase when running concurrent Graphics.RotateTransform operations across
Running my script through Devel::NYTProf showed that the following portion of code took up

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.