My kernel threads access a linear character array in a coalesced fashion. If I

Question

0

Asked: May 30, 20262026-05-30T07:33:30+00:00 2026-05-30T07:33:30+00:00

My kernel threads access a linear character array in a coalesced fashion. If I

0

My kernel threads access a linear character array in a coalesced fashion. If I map
the array to texture I don’t see any speedup. The running times are
almost the same. I’m working on a Tesla C2050 with compute capability 2.0 and read
somewhere that global accesses are cached. Is that true? Perhaps that is why I
am not seeing a difference in the running time.

The array in the main program is

char *dev_database = NULL;
cudaMalloc( (void**) &dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );

and I bind it to texture texture<char> texdatabase with

cudaBindTexture(NULL, texdatabase, dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );

Each thread then reads a character ch = tex1Dfetch(texdatabase, p + id) where id
is threadIdx.x + blockIdx.x * blockDim.x and p is an offset.

I’m binding only once and dev_database is a large array. Actually I found that
if the size is too large the bind fails. Is there a limit on the size of the array
to bind? Thanks very much.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T07:33:32+00:00

There are several possibilities for why you don’t see any difference in performance, but the most likely is that this memory access is not your bottleneck. If it is not your bottleneck, making it faster will have no effect on performance.

Regarding caching: for this case, since you are reading only bytes, each warp will read 32 bytes, which means each group of 4 warps will map to each cache line. So assuming few cache conflicts, you will get up to 4x reuse from the cache. So if this memory access is a bottleneck, it is conceivable that the texture cache might not benefit you more than the general-purpose cache.

You should first determine if you are bandwidth bound and if this data access is the culprit. Once you have done that, then optimize your memory accesses. Another tactic to consider is to access 4 to 16 chars per thread per load (using a char4 or int4 struct with byte packing/unpacking) rather than one per thread to increase the number of memory transactions in flight at a time — this can help to saturate the global memory bus.

There is a good presentation by Paulius Micikevicius from GTC 2010 that you might want to watch. It covers both analysis-driven optimization and the specific concept of memory transactions in flight.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My kernel threads access a linear character array in a coalesced fashion. If I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply