Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8290095
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T12:41:50+00:00 2026-06-08T12:41:50+00:00

I ran the visual profiler on a CUDA application of mine. The application calls

  • 0

I ran the visual profiler on a CUDA application of mine. The application calls a single kernel multiple times if the data is too large. This kernel has no branching.

The profiler reports a high instruction replay overhead of 83.6% and a high global memory instruction replay overhead of 83.5%.

Here is how the kernel generally looks:

// Decryption kernel
__global__ void dev_decrypt(uint8_t *in_blk, uint8_t *out_blk){

    __shared__ volatile word sdata[256];
    register uint32_t data;

    // Thread ID
#define xID (threadIdx.x + blockIdx.x * blockDim.x)
#define yID (threadIdx.y + blockIdx.y * blockDim.y)
    uint32_t tid = xID + yID * blockDim.x * gridDim.x;
#undef xID
#undef yID

    register uint32_t pos4 = tid%4;
    register uint32_t pos256 = tid%256;
    uint32_t blk = pos256&0xFC;

    // Indices
    register uint32_t index0 = blk + (pos4+3)%4;
    register uint32_t index1 = blk + (pos4+2)%4;

    // Read From Global Memory
    b0[pos256] = ((word*)in_blk)[tid+4] ^ dev_key[pos4];

    data  = tab(0,sdata[index0]);
    data ^= tab(1,sdata[index1]);
    sdata[pos256] = data ^ tab2[pos4];

    data  = tab(0,sdata[index0]);
    data ^= tab(1,sdata[index1]);
    sdata[pos256] = data ^ tab2[2*pos4];

    data  = tab(0,sdata[index0]);
    data ^= tab(1,sdata[index1]);
    data ^= tab2[3*pos4];

    ((uint32_t*)out_blk)[tid] = data + ((uint32_t*)in_blk)[tid];
}

As you can see there are no branches. The threads will initially read from global memory based on thread ID + 16 bytes. They will then write to an output buffer after performing an operation with data from global memory based on their thread ID.

Any ideas why this kernel would have so much overhead?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T12:41:51+00:00Added an answer on June 8, 2026 at 12:41 pm

    The source of the instruction replay in this case is non-uniform constant memory access within a warp. In you code, tab is stored in constant memory and indexed according to some combination of thread index and data stored shared memory. The result would appear to be non-uniform access threads within the same warp. Constant memory is really intended for cases where all threads in a warp access the same word, then the value can be broadcast from constant memory cache in a single operation, otherwise warp serialization occurs.

    In cases where non-uniform access of small, read-only datasets is required, it would probably be better to bind the data to a texture than store it is constant memory.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I ran visual studio analysis on my code and i found that a large
I ran into an issue today because I am developing an application in Visual
I ran into this issue quite a few times. Let's say I have a
I ran into this problem in a Visual Basic program that uses WMI but
I'm writing this program in Visual Basic .NET to organize different fields of data,
I have a program that compares two files. I ran visual studio analysis and
Here is the sample code that I ran on Visual Studio 2010: #include <iostream>
In Visual Studio 2008 Team System, I just ran Code Analysis (from the Analyze
I just ran into something in Visual Studio 2010 RC that wasn't previously happening
Ran into this error message while trying to select some records off a table.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.