I was using a naive prime generated function. This code takes about, 5.25 seconds

Question

0

Asked: June 14, 20262026-06-14T22:27:03+00:00 2026-06-14T22:27:03+00:00

I was using a naive prime generated function. This code takes about, 5.25 seconds

0

I was using a naive prime generated function. This code takes about, 5.25 seconds to generate 10k prime numbers (device_primes[0] holds the number primes already found, the remaining position the prime numbers found).

_global__ void getPrimes(int *device_primes,int n)
{ 
    int c = 0;
    int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
    int num = thread_id+2;

    if (thread_id == 0) device_primes[0] = 1;
    __syncthreads();

    while(device_primes[0] < n)
    {
        for (c = 2; c <= num - 1; c++)
        { 
            if (num % c == 0) //not prime
            {
                break;
            }
        }
        if (c == num) //prime
        {
            int pos = atomicAdd(&device_primes[0],1);
            device_primes[pos] = num;
        }
        num += blockDim.x * gridDim.x; // Next number for this thread       
    }
}

I was just starting to optimize the code, and i made the follow modification, instead of :

for (c = 2; c <= num - 1; c++)
{ 
    if (num % c == 0) //not prime
         break;
}
 if (c == num) {...}

i have now :

   int prime = 1;

   ...
   for (c = 2; c <= num - 1 && prime; c++)
    { 
        if (num % c == 0) prime = 0; // not prime
    }
     if (prime) {...} // if prime

Now i can generate 10k in 0.707s. I was just wondering why such speed up with a this simple modification, is break that bad?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T22:27:04+00:00

As Tony suggested, divergent code execution can cause major slow downs in gpu code, forcing some code to run in serial rather than parallel. In the slow version of the code above, threads that hit the break diverge from code that continues.

The cuda c programming guide is a good resource for gpu programming techniques. Here is what it says about control flow:

Any flow control instruction (if, switch, do, for, while) can significantly impact the effective instruction throughput by causing threads of the same warp to diverge (i.e., to follow different execution paths). If this happens, the different executions paths have to be serialized, increasing the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

Newer nvidia hardware and cuda versions can handle some branching a little better than older versions, but it’s still best to avoid branching whenever possible.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was using a naive prime generated function. This code takes about, 5.25 seconds

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply