Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6022549
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T03:52:24+00:00 2026-05-23T03:52:24+00:00

I have CUDA 2.1 installed on my machine and it has a graphic card

  • 0

I have CUDA 2.1 installed on my machine and it has a graphic card with 64 cuda cores.
I have written a program in which I initialize simultaneously 30000 blocks (and 1 thread per block). But am not getting satisfying results from the gpu (It performs slowly than the cpu)

Is it that the number of blocks must be smaller than or equal to the number of cores for good performance? Or is it that the performance has nothing to do with number of blocks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T03:52:25+00:00Added an answer on May 23, 2026 at 3:52 am

    CUDA cores are not exactly what you might call a core on a classical CPU. Indeed, they have to be viewed as nothing more than ALUs (Arithmetic and Logic Units), which are just able to compute ready operations.

    You might know that threads are handled per warps (groups of 32 threads) inside the blocks you’ve defined. When your blocks are dispatched on the different SMs (Streaming Multiprocessors, they are the actual cores of the GPU), each SM schedules warps within a block to optimize the computation time in regard of the memory access time needed to get threads’ input data.

    The problem is threads are always handled through their belonging warp, so if you have only one thread per block, the SM it is running on won’t be able to schedule through warps and you won’t take advantage of the multiple CUDA cores available. Your CUDA cores will be waiting for data to process, since CUDA cores compute far quicker than data are retrieved through memory.

    Having lots of blocks with few threads is not what the GPU is awaiting. In this case, you face the block per SM limitation (this number depends on your device), which force your GPU to spend a lot of time to put blocks on SM and then remove them to treat the next ones. You should rather increase the number of threads in your blocks instead of the number of blocks in your application.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am having a weird problem .. I have written a CUDA code which
hi i have a cuda program which run successfully here is code for cuda
We have some nightly build machines that have the cuda libraries installed, but which
I have written a CUDA code to solve an NP-Complete problem, but the performance
I have a CUDA kernel which I'm compiling to a cubin file without any
When compiling your CUDA code, you have to select for which architecture your code
I have a CUDA program that seems to be hitting some sort of limit
I just have a question about my cuda program that I wrote. It allows
I have written a CUDA function that calculates a convex envelop in a set
I have made a Simple CUDA dll the code which I am displaying below.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.