Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8508323
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T03:07:52+00:00 2026-06-11T03:07:52+00:00

I’m trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition

  • 0

I’m trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.

My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.

One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.

As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in ‘Stream 2’. That’s why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.

enter image description here

What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.

Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.

Are there any obvious reasons for this behavior?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T03:07:53+00:00Added an answer on June 11, 2026 at 3:07 am

    There are three issues that I can tell by the trace.

    1. Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.

    2. Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.

    3. It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.

    The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.

    If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.

    The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I am reading a book about Javascript and jQuery and using one of the
I am trying to render a haml file in a javascript response like so:
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.