This is my code. I have lot of threads so that those threads calling this function many times.
Inside this function I am creating an array. It is an efficient implementation?? If it is not please suggest me the efficient implementation.
__device__ float calculate minimum(float *arr)
{
float vals[9]; //for each call to this function I am creating this arr
// Is it efficient?? Or how can I implement this efficiently?
// Do I need to deallocate the memory after using this array?
for(int i=0;i<9;i++)
vals[i] = //call some function and assign the values
float min = findMin(vals);
return min;
}
There is no “array creation” in that code. There is a statically declared array. Further, the standard CUDA compilation model will inline expand
__device__functions, meaning that thevalswill be compiled to be in local memory, or if possible even in registers.All of this happens at compile time, not run time.