Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3430284
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T07:11:10+00:00 2026-05-18T07:11:10+00:00

There used to be profiling counters in cudaprof for global memory (gst_coherent, gst_incoherent, gld_coherent,

  • 0

There used to be profiling counters in cudaprof for global memory (gst_coherent, gst_incoherent, gld_coherent, gld_incoherent) that were useful and clear to me because they told me how many uncoalesced global reads and writes I had.

Now, there seems to be only “gst requests” and “gld requests”. These are the total loads/stores per warp on mp 0. How do I determine if I have uncoalesced reads/writes? I’m guessing that there would be fewer requests if the requests were coalesced. Am I supposed to figure out how many I expect per thread and compare? Unfortunately, my kernel is too dynamic for that.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T07:11:11+00:00Added an answer on May 18, 2026 at 7:11 am

    The coherent/incoherent counters are relevant on sm_10/sm_11 devices, where accesses had to be aligned and coalesced to avoid pathological performance. On sm_12 and sm_13 the hardware attempts to coalesce accesses wherever possible into segment transactions, and on sm_2x the L1 cache provides a similar function with the additional beenift of the cache for when this is not possible.

    Ideally you would have a feel for how much data you are reading and writing and compare this with the achieved performance, this will give you an idea of the efficiency. However given that your kernel is very data-dependent you should take a look at a couple of the presentations from GTC2010 to understand the other information that is available in the profiler. I’d recommend the Fundamental Performance Optimizations for GPUs talk and, more importantly but following on from the first one, the Analysis-Driven Performance Optimization talk.

    You could also consider instrumenting your code manually with a few extra counters.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

It seems like there used to be way more binary protocols because of the
I am profiling an application and noticed that 52% (195MB) of the memory is
There used to be an option in the submit wizard where you can specify
I'm sure there used to be a plugin for this kinda stuff, but now
Is anyone out there used Rapidminer for sentiment analysis... Is this a right combination???
In playframework 1.x there used to be some bundled java extensions for the templating
I was trying to find a bandwidth profiler in flex like there used to
Or the concepts/best practices used there are now deprecated? I'm just starting to use
I was reading this answer and trying to copy the method used there, but
I'm sure this was asked somewhere in another question, but the wording used there

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.