Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8850503
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T12:55:08+00:00 2026-06-14T12:55:08+00:00

This is a performance-related question. I’ve written the following simple CUDA kernel based on

  • 0

This is a performance-related question. I’ve written the following simple CUDA kernel based on the "CUDA By Example" sample code:

#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128

__device__ const unsigned char *m = "Goodbye, cruel world!";

__global__ void kernel_sha1(unsigned char *hval) {
  sha1_ctx ctx[1];
  unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

  while(tid < N) {
    sha1_begin(ctx);
    sha1_hash(m, 21UL, ctx);
    sha1_end(hval+tid*SHA1_DIGEST_SIZE, ctx);
    tid += blockDim.x * gridDim.x;
  }
}

The code seems to me to be correct and indeed spits out 37,426 copies of the same hash (as expected. Based on my reading of Chapter 5, section 5.3, I assumed that each thread writing to the global memory passed in as hval would be extremely inefficient.

I then implemented what I assumed would be a performance-boosting cache using shared memory. The code was modified as follows:

#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128

__device__ const unsigned char *m = "Goodbye, cruel world!";

__global__ void kernel_sha1(unsigned char *hval) {
  sha1_ctx ctx[1];
  unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
  __shared__ unsigned char cache[THREAD_COUNT*SHA1_DIGEST_SIZE];

  while(tid < N) {
    sha1_begin(ctx);
    sha1_hash(m, 21UL, ctx);
    sha1_end(cache+threadIdx.x*SHA1_DIGEST_SIZE, ctx);

    __syncthreads();
    if( threadIdx.x == 0) {
      memcpy(hval+tid*SHA1_DIGEST_SIZE, cache, sizeof(cache));
    }
    __syncthreads();
    tid += blockDim.x * gridDim.x;
  }
}

The second version also appears to run correctly, but is several times slower than the initial version. The latter code completes in about 8.95 milliseconds while the former runs in about 1.64 milliseconds. My question to the Stack Overflow community is simple: Why?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T12:55:09+00:00Added an answer on June 14, 2026 at 12:55 pm

    I looked through CUDA by Example and couldn’t find anything resembling this. Yes there is some discussion of GPU hash tables in the appendix, but it looks nothing like this. So I really have no idea what your functions do, especially sha1_end. If this code is similar to something in that book, please point it out, I missed it.

    However, if sha1_end writes to global memory once (per thread) and does so in a coalesced way, there’s no reason that it can’t be quite efficient. Presumably each thread is writing to a different location, so if they are adjacent more-or-less, there are definitely opportunities for coalescing. Without going into the details of coalescing, suffice it to say that it allows multiple threads to write data to memory in a single transaction. And if you are going to write your data to global memory, you’re going to have to pay this penalty at least once, somewhere.

    For your modification, you’ve completely killed this concept. You have now performed all the data copying from a single thread, and the memcpy means that subsequent data writes (ints, or chars, whatever) are occurring in separate transactions. Yes, there is a cache which may help with this, but it’s completely the wrong way to do it on GPUs. Let each thread update global memory, and take advantage of opportunities to do it in parallel. But when you force all the updates on a single thread, then that thread must copy the data sequentially. This is probably the biggest single cost factor in the timing difference.

    The use of __syncthreads() also imposes additional cost.

    Section 12.2.7 of the CUDA by Examples book refers to visual profiler (and makes mention that it can gather information about coalesced accesses). The visual profiler is a good tool to help try to answer questions like this.

    If you want to learn more about efficient memory techniques and coalescing, I would recommend the NVIDIA GPU computing webinar entitled “GPU Computing using CUDA C – Advanced 1 (2010)”. The direct link to it is here with slides.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This question is related to performance. If I use a selector like the following
Please note this question related to performance only. Lets skip design guidelines, philosophy, compatibility,
This question is related to database/RDBMS where i have little knowledge regarding performance and
This is related to this question . It got me thinking that, for example,
I have a java threads related question. To take a very simple example, lets
This one is related to my previous question on the performance of Arrays and
I have a serious performance problem. I have a database with (related to this
I have some questions about the performance of this simple python script: import sys,
Related Question: vector <unsigned char> vs string for binary data . My code uses
This question is related to: ASMX Web Service slow first request . I inherited

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.