My kernel uses registers extensively.
When compiling for 1.2 devices --ptxas-options=-v reports 83 registers. When I am trying to compile for 2.0 there are only 63 registers in use, the rest of local data are put into local memory. Experiments with ‘–maxrregcount’ give limit of 124 registers per thread for 1.2 devices and as few as 63 registers for 2.0.
Is it possible to put all the data into registers on 2.0 architecture?
Unfortunately, the per-thread register limit for compute capability 2.x cards is 63 registers per thread. There isn’t anyway to stop local memory spillage if you have a very complex kernel which consumes a lot of registers.