I have CUDA 2.1 installed on my machine and it has a graphic card with 64 cuda cores.
I have written a program in which I initialize simultaneously 30000 blocks (and 1 thread per block). But am not getting satisfying results from the gpu (It performs slowly than the cpu)
Is it that the number of blocks must be smaller than or equal to the number of cores for good performance? Or is it that the performance has nothing to do with number of blocks
CUDA cores are not exactly what you might call a core on a classical CPU. Indeed, they have to be viewed as nothing more than ALUs (Arithmetic and Logic Units), which are just able to compute ready operations.
You might know that threads are handled per warps (groups of 32 threads) inside the blocks you’ve defined. When your blocks are dispatched on the different SMs (Streaming Multiprocessors, they are the actual cores of the GPU), each SM schedules warps within a block to optimize the computation time in regard of the memory access time needed to get threads’ input data.
The problem is threads are always handled through their belonging warp, so if you have only one thread per block, the SM it is running on won’t be able to schedule through warps and you won’t take advantage of the multiple CUDA cores available. Your CUDA cores will be waiting for data to process, since CUDA cores compute far quicker than data are retrieved through memory.
Having lots of blocks with few threads is not what the GPU is awaiting. In this case, you face the block per SM limitation (this number depends on your device), which force your GPU to spend a lot of time to put blocks on SM and then remove them to treat the next ones. You should rather increase the number of threads in your blocks instead of the number of blocks in your application.