In MPI, each rank has a unique address space and communication between them happens via message passing.
I want to know how MPI works on a multicore machine which has a shared memory.
If the ranks are on two different machines with no shared memory, then MPI has to use messages for communication. But if ranks are on the same physical machine (but still each rank has a different address space), will the MPI calls take advantage of the shared memory?
For example, suppose I’m issuing an ALLREDUCE call. I have two machines M1 and M2, each with 2 cores. Rank R1 and R2 are on core 1 and core 2 of machine M1 and R3 and R4 are on core 1 and 2 of machine M2. How would the ALLREDUCE happen? Will there be more than 1 message transmitted?
Ideally, I would expect R1 and R2 to do a reduce using the shared memory available to them (similarly R3 andR4) followed by message exchange between M1 and M2.
Is there any documentation where I can read about the implementation details of the collective operations in MPI?
Implementation of collective operations differs from one MPI library to another. The best place to look is the source code of the concrete library that you are using/want to use.
I can tell you about how Open MPI implements collectives. Open MPI is composed of various layers at which different components (modules) live. There is the
collframework for collective operations that uses the lower-levelbtlframework to transfer messages. There are many different algorithms implemented in thecollframework as well as many different modules that implement those algorithms. A scoring mechanism is used to select what the library thinks is the best module for your case, but this can be easiliy overriden with MCA parameters. The most prominent one is thetunedmodule that is well tested and scales well on all kinds of interconnects, from shared memory to InfiniBand. Thetunedmodule is quite oblivious as to where processes are located. It just uses thebtlframework to send messages andbtltakes care to use shared memory or network operations. Some of the algorithms in thetunedmodule are hierarchical and with proper tuning of the parameters (OMPI’s great flexibility comes from the fact that many internal MCA parameters can changed without recompiling) those algorithms can be made to match the actual hierarchy of the cluster. There is anothercollmodule calledhierarchthat tries its best to gather as much physical topology information as possible and to use it in order to optimise collective communications.Unfortunately virtually all MPI implementations are written in C with very thin layers on top to provide Fortran interfaces. So I hope you have above average knowledge of C if you’d like to dive into this topic. There are also many research papers on optimisation of collective operations. Some of them are available for free, others are available through academic subscriptions.