I’m trying to reach peak performance of each SM from the code below. The

Question

0

Asked: May 27, 20262026-05-27T21:32:51+00:00 2026-05-27T21:32:51+00:00

I’m trying to reach peak performance of each SM from the code below. The

0

I’m trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max.

__global__ void new_ker(float *x)
{
  int index = threadIdx.x+blockIdx.x*blockDim.x;
  float a,b;
  a=0;
  b=x[index];
  //LOOP=10000000
  //No. of blocks = 1
  //Threads per block = 512 (I'm using GTX 275 - GT200 Arch.)
  #pragma unroll 2048
  for(int i=0;i<LOOP;i++){
       a=a*b+b;
  }  

  x[index] = a;

 }

I don’t want to increase ILP in the code. Any ideas why it’s not reaching peak??

int main(int argc,char **argv)
{

   //Initializations
   float *x;
   float *dx;
   cudaEvent_t new_start,new_stop;
   float elapsed;
   double gflops;
   x = 0;
   flag = 0;
   cudaMalloc((void **)&dx,sizeof(float)*THPB);

   //ILP=1  
   cudaEventCreate(&new_start);
   cudaEventCreate(&new_stop);
   printf("Kernel1:\n");
   cudaEventRecord(new_start, 0);
   new_ker<<<BLOCKS,THPB>>>(dx);
   cudaEventRecord(new_stop,0);
   cudaEventSynchronize(new_stop);
   cudaEventElapsedTime(&elapsed,new_start,new_stop);
   x = (float *)malloc(sizeof(float)*THPB);
   cudaMemcpy(x,dx,sizeof(float)*THPB,cudaMemcpyDeviceToHost);

   gflops = ((double)(BLOCKS)*(THPB)*LOOP/elapsed)/1000000;
   printf("\t%f",gflops);
   cudaEventDestroy(new_start);
   cudaEventDestroy(new_stop);
   return 0;
}

Platform:
CUDA 3.0
NVIDIA GeForce GTX275 (GT200)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T21:32:51+00:00

If I put together a complete repro case from your code, using the correct FLOP calculation:

#include <stdio.h> 

#define LOOP (10000000)
#define BLOCKS (30)
#define THPB (512)

__global__ void new_ker(float *x)
{
  int index = threadIdx.x+blockIdx.x*blockDim.x;
  float a,b;
  a=0;
  b=x[index];
  #pragma unroll 2048
  for(int i=0;i<LOOP;i++){
       a=a*b+b;
  }  

  x[index] = a;
}

int main(int argc,char **argv)
{

   //Initializations
   float *x;
   float *dx;
   cudaEvent_t new_start,new_stop;
   float elapsed;
   double gflops;
   x = 0;
   cudaMalloc((void **)&dx,sizeof(float)*THPB);

   //ILP=1  
   cudaEventCreate(&new_start);
   cudaEventCreate(&new_stop);
   printf("Kernel1:\n");
   cudaEventRecord(new_start, 0);
   new_ker<<<BLOCKS,THPB>>>(dx);
   cudaEventRecord(new_stop,0);
   cudaEventSynchronize(new_stop);
   cudaEventElapsedTime(&elapsed,new_start,new_stop);
   x = (float *)malloc(sizeof(float)*THPB*BLOCKS);
   cudaMemcpy(x,dx,sizeof(float)*THPB*BLOCKS,cudaMemcpyDeviceToHost);

   gflops = 2.0e-6 * ((double)(LOOP)*double(THPB*BLOCKS)/(double)elapsed);
   printf("\t%f\n",gflops);
   cudaEventDestroy(new_start);
   cudaEventDestroy(new_stop);
   return 0;
}

And I compile it and run it on a 1.4GHz GTX275 with CUDA 3.2 on a 64 bit linux platform:

$ nvcc -arch=sm_13 -Xptxas="-v" -o perf perf.cu
ptxas info    : Compiling entry function '_Z7new_kerPf' for 'sm_13'
ptxas info    : Used 4 registers, 8+16 bytes smem, 8 bytes cmem[1]
$ ./perf 
Kernel1:
        671.806039

I get within 0.01% of peak FLOP/s for that card running a pure FMAD code (1.4 GHz * 2 FLOP * 8 cores/MP * 30 MP) = 672 GFLOP/s.

So it seems that the code does, in fact, hit peak FLOP/s with one block per multiprocessor, but you just are not calculating the FLOP/s number correctly.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to reach peak performance of each SM from the code below. The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply