I have written a CUDA code to solve an NP-Complete problem, but the performance was not as I suspected.
I know about “some” optimization techniques (using shared memroy, textures, zerocopy…)
What are the most important optimization techniques CUDA programmers should know about?
You should read NVIDIA’s CUDA Programming Best Practices guide: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide.pdf
This has multiple different performance tips with associated “priorities”. Here are some of the top priority tips: