Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8987341
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T21:46:02+00:00 2026-06-15T21:46:02+00:00

I have read many times about CUDA Thread/Blocks and Array, but still don’t understand

  • 0

I have read many times about CUDA Thread/Blocks and Array, but still don’t understand point: how and when CUDA starts to run multithread for kernel function. when host calling kernel function, or inside kernel function.

For example I have this example, It just simple transpose an array. (so, it just copy value from this array to another array).

__global__
void transpose(float* in, float* out, uint width) {
    uint tx = blockIdx.x * blockDim.x + threadIdx.x;
    uint ty = blockIdx.y * blockDim.y + threadIdx.y;
    out[tx * width + ty] = in[ty * width + tx];
}

int main(int args, char** vargs) {
    /*const int HEIGHT = 1024;
    const int WIDTH = 1024;
    const int SIZE = WIDTH * HEIGHT * sizeof(float);
    dim3 bDim(16, 16);
    dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
    float* M = (float*)malloc(SIZE);
    for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; }
    float* Md = NULL;
    cudaMalloc((void**)&Md, SIZE);
    cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
    float* Bd = NULL;
    cudaMalloc((void**)&Bd, SIZE); */
    transpose<<<gDim, bDim>>>(Md, Bd, WIDTH);   // CALLING FUNCTION TRANSPOSE
    cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
    return 0;
}

(I have commented all lines that not important, just have the line calling function transpose)

I have understand all lines in function main except the line calling function tranpose. Does it true when I say: when we call function transpose<<<gDim, bDim>>>(Md, Bd, WIDTH), CUDA will automatically assign each elements of array into one thread (and block), and when we calling “one time” tranpose, CUDA will running gDim * bDim times tranpose on gDim * bDim threads.

This point makes me feel frustrated so much, because it doesn’t like multithread in java, when I use 🙁 Please tell me.

Thanks 🙂

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T21:46:03+00:00Added an answer on June 15, 2026 at 9:46 pm

    Your understanding is in essence correct.

    transpose is not a function, but a CUDA kernel. When you call a regular function, it only runs once. But when you launch a kernel a single time, CUDA will automatically run the code in the kernel many times. CUDA does this by starting many threads. Each thread runs the code in your kernel one time. The numbers inside the tripple brackets (<<< >>>) is called the kernel execution configuration. It determines how many threads will be launched by CUDA and specifies some relationships between the threads.

    The number of threads that will be started is calculated by multiplying up all the values in the grid and block dimensions inside the triple brackets. For instance, the number of threads will be 1,048,576 (16 * 16 * 64 * 64) in your example.

    Each thread can read some variables to find out which thread it is. Those are the blockIdx and threadIdx structures at the top of the kernel. The values reflect the ones in the kernel execution configuration. So, if you run your kernel with a grid configuration of 16 x 16 (the first dim3 in the triple brackets, you will get threads that, when they each read the x and y values in the blockIdx structure, will get all possible combinations of x and y between 0 and 15.

    So, as you see, CUDA does not know anything about array elements or any other data structures that are specific to your kernel. It just deals with threads, thread indexes and block indexes. You then use those indexes to to determine what a given thread should do (in particular, which values in your application specific data it should work on).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have read many questions about Android, J2ME and RecordStore , but I still
I have read about DFS and BFS many times but I have this doubt
I have read many questions about the facebook login but until not I didnt
1) I have read many times about undefined behavior in C. That is about
I have read so many times, here and everywhere on the net, that mutexes
I have read many answers to questions about dynamically resizing NSWindows and nothing has
I have read many answers regarding this still i am getting confused if i
I have read many question about improving the performance of C++ and C code
I am trying understand ViewModels deeper and I have read many articles and blogs
Alright...I've given the site a fair search and have read over many posts about

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.