I’ve been working with time measuring (benchmarking) in parallel algorithms, more specific, matrix multiplication.

Question

0

Asked: May 27, 20262026-05-27T11:54:09+00:00 2026-05-27T11:54:09+00:00

I’ve been working with time measuring (benchmarking) in parallel algorithms, more specific, matrix multiplication.

0

I’ve been working with time measuring (benchmarking) in parallel algorithms, more specific, matrix multiplication. I’m using the following algorithm:

if(taskid==MASTER) {
  averow = NRA/numworkers;
  extra = NRA%numworkers;
  offset = 0;
  mtype = FROM_MASTER;
  for (dest=1; dest<=numworkers; dest++)
  {
     rows = (dest <= extra) ? averow+1 : averow;    
     MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
     MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
     MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,MPI_COMM_WORLD);
     MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
     offset = offset + rows;
  }
  mtype = FROM_WORKER;
  for (i=1; i<=numworkers; i++)
  {
     source = i;
     MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
     MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
     MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, 
              MPI_COMM_WORLD, &status);
     printf("Resultados recebidos do processo %d\n",source);
  }
}

else {
  mtype = FROM_MASTER;
  MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
  MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
  MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
  MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

  for (k=0; k<NCB; k++)
     for (i=0; i<rows; i++)
     {
        c[i][k] = 0.0;
        for (j=0; j<NCA; j++)
           c[i][k] = c[i][k] + a[i][j] * b[j][k];
     }
  mtype = FROM_WORKER;
  MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
  MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
  MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}

I noticed that, for square matrices it took less time than for rectangular ones.
For example: if I use 4 nodes (one as master) and A is 500×500 and B is 500×500, the number of iterations per node equals 41.5 million, while if A is 2400000×6 and B is 6×6, it iterates 28.8 million times per node. Although the second case takes less iterations, it took about 1.00 second, while the first took only about 0.46s.

Logically, the second should be faster, considering it has less iterations per node.
Doing some math, I realized that the MPI sends and receives 83,000 elements per message on the first case, and 4,800,000 elements on the second.

Does the size of the message justify the delay?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T11:54:10+00:00

Editorial Team

2026-05-27T11:54:10+00:00Added an answer on May 27, 2026 at 11:54 am

The size of messages sent over MPI will definitely affect the performance
of your code. Take a look at THESE graphs posted in one of the popular MPI
implementation’s webpage.

As you can see in the first graph, the latency of communication increases
with message size. This trend is applicable to any network and not just InfiniBand
as indicated in this graph.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve been working with time measuring (benchmarking) in parallel algorithms, more specific, matrix multiplication.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply