so if i want do Unified Virtual Address (UVA) copy between two GPUs (using cudaMemcpyAsync with kind argument is cudaMemcopyDefault), whose stream should I use? stream of device of the source memory? or stream of dev of the destined memory?
thank you
Suggestion: use cudaMemcpyPeerAsync instead. See this question as an example.
I guess to answer your question, from here:
So choose a stream that corresponds to the device that corresponds to the most recent
cudaSetDevice()call that you made.