Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6706251
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T07:29:54+00:00 2026-05-26T07:29:54+00:00

I asked the same question here , but I think it was too long

  • 0

I asked the same question here, but I think it was too long so I’ll try again in a shorter way:

I’ve got a C++ program using the latest OpenMPI on a Rocks cluster under a master/slave setup. The slaves perform a task and then report data to the master using the blocking MPI_SEND / MPI_RECV calls (through Boost MPI), which writes the data to a database. The master is currently significantly slower than the slaves. I’m having trouble with the program because about half of the slaves get stuck on the first task and never report their data; using strace/ltrace, it seems that they’re stuck polling in MPI_SEND and their message never gets received.

I wrote a program to test this theory (again, listed in full here) and I can cause a similar problem – slave communications slow down significantly so they do less tasks than they should – by manipulating the speed of the slaves and masters using sleep. When the speed(master) > speed(slaves), everything works fine. When speed(master) < speed(slaves), messages get significantly delayed for some slaves every time.

Any ideas why this might be?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T07:29:55+00:00Added an answer on May 26, 2026 at 7:29 am

    As far as I see this results from the recv in the while loop in the master node.

     ...
     while (1) {
     // Receive results from slave.
          stat = world.recv(MPI_ANY_SOURCE,MPI_ANY_TAG);
     ...
    

    When there is a message from one slave the master cannot get any messages until the code inside while loop is finished (which take some while as there is a sleep), as the master node is not running parallel. Therefore all other slaves cannot start sending their messages until the first slave has finished sending his message. Then the next slave can start sending the message but then all other slaves are stopped until the code inside the while loop is executed.

    This result in the behavior you see, that the slaves communication is very slow. to avoid this problem you need to implement the point to point communication non blocking or use global communications.

    UPDATE 1:

    Lets assume that the master distributed his data. Now he waits until the slaves report back. When the first slave reports back he will first send his REPORTTAG and then his DONETAG. Now the master will send him back a new job if the

     currentTask < numtasks
    

    Now the slaves start again with his calculation. It might be now the case that until he is finished the master was only able to handle another slave. So the slave of the beginning is now again sending first his REPORTTAG and then his DONETAG and gets an new job. When this continues in the end only 2 slaves have get new jobs and the rest were not able to finish their jobs. So that at some point this is true:

     currentTask >= numtasks
    

    Now you stop all jobs even not all slaves have reported their data back and have done more than one task.

    This problem occurs most when the network connection of the different nodes is highly different. The reason is that the send and receive are not handled after their call, instead the communication takes place if two of these functions are able to make some kind of handshake.

    As solutions I would suggest either:

    • Make sure that all slaves are finished before killing all jobs
    • Use gather and scatter instead of messages, then all slaves synchronized after each task.
    • Use buffered or unbuffered send and receive operations, if the messages are not to big. Make sure that you did not get a memory overflow on the Master
    • Change from Master/ Slave to a more parallel workmodus, e.g divide all task to two nodes, then divide the tasks further from these nodes to the next two, and so on. In the end send the task this way back. This solution might also have the advantage that the communication cost are only of O(logn) instead of O(n).

    Hope this helped.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

(I asked this question in another way , and got some interesting responses but
i've asked same question before here , but now i need to higlight the
This is driving me nuts. I believe I asked this exact same question, but
Here is a link to another question I asked concerning the same project I
I'm working in Ruby, but I think this question is best asked agnostic of
I asked this question on the Google Group but I think I will get
This question is about the same program I previously asked about . To recap,
I had asked the same question for iOS on iPad but now I am
i have asked this question before but no answer was there. so asking again.
I asked this question yesterday got a pretty good answer but can't figure out

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.