I suspect I have a fine grained memory error in a large CUDA kernel I’m running. Device side printf is showing some varying values on variables that should be deterministic. The “stable” version of the CUDA development tools I’m using removed device emulation mode, and its version of cuda-gdb doesn’t work with templated functions. Cuda-memcheck runs, but doesn’t catch anything.
On the cpu I would use valgrind or electric fence to catch memory errors like this. What neat tricks are there for debugging memory errors if all you have available is printf?
For example, is there a way to flood the entire memory space with nans and use printfs to find where they first pop up in my computations?
For this sort of thing, I like to allocate the whole available global memory space, then manange the memory myself. Use a custom memset function to set the whole allocation to recognisable word sized bit pattern, then initialize blocks within the allocation for your kernel to use. If you implement a simple device side assert to trap that bit pattern and report thread, block, line where it shows up, you should be able to isolate out of bounds global memory reads that cuda-memcheck doesn’t catch.