I am new to CUDA. I am trying to parallelize the following code. Right now it’s sitting on kernel but is not using threads at all, thus slow. I tried to use this answer but to no avail so far.
The kernel is supposed to generate first n prime numbers, put them into device_primes array and this array is later accessed from host. The code is correct and works fine in serial version but I need to speed it up, perhaps with use of shared memory.
//CUDA kernel code
__global__ void generatePrimes(int* device_primes, int n)
{
//int i = blockIdx.x * blockDim.x + threadIdx.x;
//int j = blockIdx.y * blockDim.y + threadIdx.y;
int counter = 0;
int c = 0;
for (int num = 2; counter < n; num++)
{
for (c = 2; c <= num - 1; c++)
{
if (num % c == 0) //not prime
{
break;
}
}
if (c == num) //prime
{
device_primes[counter] = num;
counter++;
}
}
}
My current, preliminary, and definitely wrong attempt to parallelize this looks like the following:
//CUDA kernel code
__global__ void generatePrimes(int* device_primes, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int num = i + 2;
int c = j + 2;
int counter = 0;
if ((counter >= n) || (c > num - 1))
{
return;
}
if (num % c == 0) //not prime
{
}
if (c == num) //prime
{
device_primes[counter] = num;
counter++;
}
num++;
c++;
}
But this code populates the array with data that does not make sense. In addition, many values are zeroes. Thanks in advance for any help, it’s appreciated.
You have some problems in your code, for example:
This expression assigns to the
thread 0the interaction 2, tothread 1the iteration 3, and so on. The problem is that the next iteration that the threads will compute is based on the expressionnum++;. Consequently,thread 0will compute next the iteration 3, which was already computed bythread 1. Thus, leading to redundant computation. Furthermore, I think for this problem it would be easier to use only one dimension instead of two(x,y). So with this in mind you have to changenum++to:Another issue is that you did not take into consideration that the variable
counterhas to be shared among threads. Otherwise, each thread will try to find ‘n’ primes, and all of them will populate the entire array. So you have to changeint counter = 0;to a shared or global variable. Let us use a global variable so that it can be visible among all the threads from all the blocks. We can use the position zero of the arraydevice_primesto hold the variablecounter.You also have to initialize this value. Let us assign this job to only one thread, namely the thread with `id = 0, so:
However, this variable is global and it will be written by all threads. Therefore, we must guarantee that all threads, before writing on that global variable, will see that the variable
counteris 1 (first position ofdevice_primeswith primes, the zero is for thecounter) so you have to add also a barrier in the end , so:So a possible solution (albeit, an inefficient one):
The following line
atomicAdd(&device_primes[0], 1);will basically performdevice_primes[0]++;. We are using an atomic operation because the variablecounteris global and we need to guarantee mutual exclusion. Note, that you may have to compile with theflag -arch sm_20.Optimization:
Code-wise, it would be better the use of an approach with less/no synchronization. Moreover, the number of computations could also be reduced by taking into account some of the properties of prime numbers as it is show case in http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes.