As a quick backdrop for my question, with x86, it is guaranteed that a

Question

0

Asked: June 6, 20262026-06-06T16:48:11+00:00 2026-06-06T16:48:11+00:00

As a quick backdrop for my question, with x86, it is guaranteed that a

0

As a quick backdrop for my question, with x86, it is guaranteed that a individual memory access that is 4-byte aligned for a 32-bit word, or 8-byte aligned for a 64-bit word will be atomic. Thus you can create “benign data-races”, where at least one thread writes to a memory address with another thread reading from the same address, and the reader will not see the results of an incomplete write. Either the reading thread will see the entire effect of the write or it won’t.

What are the requirements in the CUDA programming model to create these types of “benign” data-race conditions? For instance, if two separate threads write a 64-bit value to the same global memory address from two separate, but concurrently running blocks on two different SM’s, will each atomically write their entire 64-bit values, with a third observer only reading back a fully updated 64-bit memory block? Or would the writes take place with a smaller granularity, and thus a third observer would only see a partial write if it attempted to read back from the memory address after the two threads had simultaneously written to it?

I understand that race-conditions are normally something to avoid, but if the requirements for memory ordering are relaxed, then there is no need to explicitly use atomic read/write functions. That being said, this is predicated on what the atomicity of an individual read/write is (i.e., how many bits, and on what alignment). Does anyone know where I can find this information?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T16:48:13+00:00

Update: @Heatsink has kindly notified me that it is indeed possible to force some memory coherency by using the __threadfence() function.

—

Unless atomic functions are used, CUDA specifically does not guarantee any coherency when accessing global memory that has been updated by any thread scheduled in the same kernel call. It is only safe to read memory that was written by a previous kernel or memory copy.

So, not only can you not assume anything about memory access patterns — you can’t even know when an update done to global memory by one thread may become visible to another thread, or indeed, if will become visible at all.

Of course, given the way the hardware is implemented in a given architecture, you may be able to find a way to implement some type of non-blocking synchronization between threads. However, I sincerely doubt that it would be possible to do that safely between blocks. What the threads in one block see will depend on which SM the block runs, which blocks have run before, and where the updates done by those blocks currently are in the cache hierarchy.

When considering threads within a block, the discussion is moot, as threads in a block can communicate with shared memory, the behavior of which is carefully specified by CUDA.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

As a quick backdrop for my question, with x86, it is guaranteed that a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply