I asked the same question here, but I think it was too long so I’ll try again in a shorter way:
I’ve got a C++ program using the latest OpenMPI on a Rocks cluster under a master/slave setup. The slaves perform a task and then report data to the master using the blocking MPI_SEND / MPI_RECV calls (through Boost MPI), which writes the data to a database. The master is currently significantly slower than the slaves. I’m having trouble with the program because about half of the slaves get stuck on the first task and never report their data; using strace/ltrace, it seems that they’re stuck polling in MPI_SEND and their message never gets received.
I wrote a program to test this theory (again, listed in full here) and I can cause a similar problem – slave communications slow down significantly so they do less tasks than they should – by manipulating the speed of the slaves and masters using sleep. When the speed(master) > speed(slaves), everything works fine. When speed(master) < speed(slaves), messages get significantly delayed for some slaves every time.
Any ideas why this might be?
As far as I see this results from the recv in the while loop in the master node.
When there is a message from one slave the master cannot get any messages until the code inside while loop is finished (which take some while as there is a sleep), as the master node is not running parallel. Therefore all other slaves cannot start sending their messages until the first slave has finished sending his message. Then the next slave can start sending the message but then all other slaves are stopped until the code inside the while loop is executed.
This result in the behavior you see, that the slaves communication is very slow. to avoid this problem you need to implement the point to point communication non blocking or use global communications.
UPDATE 1:
Lets assume that the master distributed his data. Now he waits until the slaves report back. When the first slave reports back he will first send his REPORTTAG and then his DONETAG. Now the master will send him back a new job if the
Now the slaves start again with his calculation. It might be now the case that until he is finished the master was only able to handle another slave. So the slave of the beginning is now again sending first his REPORTTAG and then his DONETAG and gets an new job. When this continues in the end only 2 slaves have get new jobs and the rest were not able to finish their jobs. So that at some point this is true:
Now you stop all jobs even not all slaves have reported their data back and have done more than one task.
This problem occurs most when the network connection of the different nodes is highly different. The reason is that the send and receive are not handled after their call, instead the communication takes place if two of these functions are able to make some kind of handshake.
As solutions I would suggest either:
Hope this helped.