When looking at the name of the performance counters in NVIDIA Fermi architecture (the file Compute_profiler.txt in the doc folder of cuda), I noticed that for L2 cache misses, there are two performance counters, l2_subp0_read_sector_misses and l2_subp1_read_sector_misses. They said that these are for two slices of L2.
Why do they have two slices of L2? Is there any relation with the Streaming Multi-processor architecture? What would be the effect of this division to the performance?
Thanks
I don’t think there is any direct relation with the streaming multiprocessor.
I just think that slice is equivalent of bank memory.
Just sum the values of the two to get the “total” L2 read misses.