This is a performance-related question. I’ve written the following simple CUDA kernel based on

Question

0

Asked: June 14, 20262026-06-14T12:55:08+00:00 2026-06-14T12:55:08+00:00

This is a performance-related question. I’ve written the following simple CUDA kernel based on

0

This is a performance-related question. I’ve written the following simple CUDA kernel based on the "CUDA By Example" sample code:

#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128

__device__ const unsigned char *m = "Goodbye, cruel world!";

__global__ void kernel_sha1(unsigned char *hval) {
  sha1_ctx ctx[1];
  unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

  while(tid < N) {
    sha1_begin(ctx);
    sha1_hash(m, 21UL, ctx);
    sha1_end(hval+tid*SHA1_DIGEST_SIZE, ctx);
    tid += blockDim.x * gridDim.x;
  }
}

The code seems to me to be correct and indeed spits out 37,426 copies of the same hash (as expected. Based on my reading of Chapter 5, section 5.3, I assumed that each thread writing to the global memory passed in as hval would be extremely inefficient.

I then implemented what I assumed would be a performance-boosting cache using shared memory. The code was modified as follows:

#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128

__device__ const unsigned char *m = "Goodbye, cruel world!";

__global__ void kernel_sha1(unsigned char *hval) {
  sha1_ctx ctx[1];
  unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
  __shared__ unsigned char cache[THREAD_COUNT*SHA1_DIGEST_SIZE];

  while(tid < N) {
    sha1_begin(ctx);
    sha1_hash(m, 21UL, ctx);
    sha1_end(cache+threadIdx.x*SHA1_DIGEST_SIZE, ctx);

    __syncthreads();
    if( threadIdx.x == 0) {
      memcpy(hval+tid*SHA1_DIGEST_SIZE, cache, sizeof(cache));
    }
    __syncthreads();
    tid += blockDim.x * gridDim.x;
  }
}

The second version also appears to run correctly, but is several times slower than the initial version. The latter code completes in about 8.95 milliseconds while the former runs in about 1.64 milliseconds. My question to the Stack Overflow community is simple: Why?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T12:55:09+00:00

I looked through CUDA by Example and couldn’t find anything resembling this. Yes there is some discussion of GPU hash tables in the appendix, but it looks nothing like this. So I really have no idea what your functions do, especially sha1_end. If this code is similar to something in that book, please point it out, I missed it.

However, if sha1_end writes to global memory once (per thread) and does so in a coalesced way, there’s no reason that it can’t be quite efficient. Presumably each thread is writing to a different location, so if they are adjacent more-or-less, there are definitely opportunities for coalescing. Without going into the details of coalescing, suffice it to say that it allows multiple threads to write data to memory in a single transaction. And if you are going to write your data to global memory, you’re going to have to pay this penalty at least once, somewhere.

For your modification, you’ve completely killed this concept. You have now performed all the data copying from a single thread, and the memcpy means that subsequent data writes (ints, or chars, whatever) are occurring in separate transactions. Yes, there is a cache which may help with this, but it’s completely the wrong way to do it on GPUs. Let each thread update global memory, and take advantage of opportunities to do it in parallel. But when you force all the updates on a single thread, then that thread must copy the data sequentially. This is probably the biggest single cost factor in the timing difference.

The use of __syncthreads() also imposes additional cost.

Section 12.2.7 of the CUDA by Examples book refers to visual profiler (and makes mention that it can gather information about coalesced accesses). The visual profiler is a good tool to help try to answer questions like this.

If you want to learn more about efficient memory techniques and coalescing, I would recommend the NVIDIA GPU computing webinar entitled “GPU Computing using CUDA C – Advanced 1 (2010)”. The direct link to it is here with slides.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is a performance-related question. I’ve written the following simple CUDA kernel based on

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply