when we use mpi_send/receive functions what happens? I mean this communication is done by value or by address of the variable that we desire to be sent and received (for example process 0 wants send variable “a” to process 1. Process 0 what exactly sends value of variable “a” or address of “a” ). And what happens when we use derived data types for communication?
when we use mpi_send/receive functions what happens? I mean this communication is done by
Share
Quite a bit of magic happens behind the scenes.
First, there’s the unexpected message queue. When the sender calls
MPI_Sendbefore the receiver has calledMPI_Recv, MPI doesn’t know where in the receiver’s memory the message is going. Two things can happen at this point. If the message is short, it is copied to a temporary buffer at the receiver. When the receiver callsMPI_Recvit first checks if a matching message has already arrived, and if it has, copies the data to the final destination. If not, the information about the target buffer is stored in the queue so theMPI_Recvcan be matched when the message arrives. It is possible to examine the unexpected queue withMPI_Probe.If the message is longer than some threshold, copying it would take too long. Instead, the sender and the receiver do a handshake with a rendezvous protocol of some sort to make sure the target is ready to receive the message before it is sent out. This is especially important with a high-speed network like InfiniBand.
If the communicating ranks are on the same machine, usually the data transfer happens through shared memory. Because MPI ranks are independent processes, they do not have access to each other’s memory. Instead, the MPI processes on the same node set up a shared memory region and use it to transfer messages. So sending a message involves copying the data twice: the sender copies it into the shared buffer, and the receiver copies it out into its own address space. There exists an exception to this. If the MPI library is configured to use a kernel module like KNEM, the message can be copied directly to the destination in the OS kernel. However, such a copy incurs a penalty of a system call. Copying through the kernel is usually only worth it for large messages. Specialized HPC operating systems like Catamount can change these rules.
Collective operations can be implemented either in terms of send/receive, or can have a completely separate optimized implementation. It is common to have implementations of several algorithms for a collective operation. The MPI library decides at runtime which one to use for best performance depending on the size of the messages and the communicator.
A good MPI implementation will try very hard to transfer a derived datatype without creating extra copies. It will figure out which regions of memory within a datatype are contiguous and copy them individually. However, in some cases MPI will fall back to using MPI_Pack behind the scenes to make the message contiguous, and then transfer and unpack it.