I’m building a toolkit that offers different algorithms in CUDA. However, many of these algorithms use static constant global data that will be used by all threads, declared this way for example:
static __device__ __constant__ real buf[MAX_NB];
My problem is that if I include all the .cuh files in the library, when the library will be instantiated all this memory will be allocated on the device, even though the user might want to use only one of these algorithms. Is there any any way around this? Will I absolutely have to use the typical dynamically allocated memory?
I want the fastest constant memory possible that can be used by all threads at runtime. Any ideas?
Thanks!
All the constant memory in a .cu file is allocated at launch (When the .cubin is generated and run, each .cu belongs to a different module)! Therefore, to use many different kernels that use constant memory, you have to divide them in .cu files as to not get a const memory overflow. The usual max is 64kb. Source: http://forums.nvidia.com/index.php?showtopic=185993