I am having a hell of a time with this problem: Using MPI, I want to combine several contiguous, non-overlapping columnar chunks of a 2-dimensional array distributed in several MPI processes into one array residing at the root process. The main condition being that the array must be the same for all the sending and receiving processes. The second condition is that the columnar chunks sent by each process can be of different widths. This seems to be a common problem in parallel programming since I have seen at least 6 question relating to this issue posted in StackOverflow. None of the answers helped me, unfortunately. I can resolve this project quite nicely when I divide the problem into row chunks, but not with columns. I realize it has to do with the differing strides in the case of the columnar subarrays. I have tried MPI vector and subarray types, both to no avail.
Using the simplified version of my code, if I execute it with COLUMNS equals to 6, I get:
0: 1 1 1 2 2 2
1: 1 1 1 2 2 2
2: 1 1 1 2 2 2
3: 1 1 1 2 2 2
4: 1 1 1 2 2 2
5: 1 1 1 2 2 2
6: 1 1 1 2 2 2
which is what I want.
On the other hand, if I execute it with COLUMNS = 5, I expect to get:
0: 1 1 1 2 2
1: 1 1 1 2 2
2: 1 1 1 2 2
3: 1 1 1 2 2
4: 1 1 1 2 2
5: 1 1 1 2 2
6: 1 1 1 2 2
Instead, I get:
0: 1 1 1 2 2
1: 2 1 1 2 2
2: 2 1 1 2 2
3: 2 1 1 2 2
4: 2 1 1 2 2
5: 1 1 1 -0 -0
6: 1 1 1 -0 -0
Listing of the simplified code:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#define ROWS 7
#define COLUMNS 6 // 5 or 6 only. I could pass this in the cmd line...
#define NR_OF_PROCESSES 2
void print_matrix (float ** X, int rows, int cols)
{
for (int i = 0; i < rows; ++i) {
printf ("%3d: ", i);
for (int j = 0; j < cols; ++j)
printf ("%2.0f ", X[i][j]);
printf ("\n");
}
}
float **allocate_matrix (int rows, int cols)
{
float *data = (float *) malloc (rows * cols * sizeof(float));
float **matrix = (float **) malloc (rows * sizeof(float *));
for (int i = 0; i < rows; i++)
matrix[i] = & (data[i * cols]);
return matrix;
}
int main (int argc, char *argv[])
{
int num_procs, my_rank, i, j, root = 0, ncols, ndims = 2, strts;
float **matrix;
MPI_Datatype sendsubarray, recvsubarray, resizedrecvsubarray;
assert (COLUMNS == 5 || COLUMNS == 6);
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &num_procs);
if (num_procs != NR_OF_PROCESSES) MPI_Abort (MPI_COMM_WORLD, -1);
MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
ncols = (my_rank == root) ? 3 : COLUMNS - 3;
strts = (my_rank == root) ? 0 : 3;
int sizes[2] = {ROWS, COLUMNS};
int subsizes[2] = {ROWS, ncols};
int starts[2] = {0, strts};
// Create and populate the matrix at each node (incl. the root):
matrix = allocate_matrix (ROWS, COLUMNS);
for (i = 0; i < ROWS; i++)
for (j = 0; j < COLUMNS; j++)
matrix[i][j] = my_rank * -1.0;
for (i = starts[0]; i < starts[0] + subsizes[0]; i++)
for (j = starts[1]; j < starts[1] + subsizes[1]; j++)
matrix[i][j] = my_rank + 1.0;
// Create the subarray type for use by each send node (incl. the root):
MPI_Type_create_subarray (ndims, sizes, subsizes, starts, MPI_ORDER_C,
MPI_FLOAT, &sendsubarray);
MPI_Type_commit (&sendsubarray);
// Create the subarray type for use by the receive node (the root):
if (my_rank == root) {
MPI_Type_create_subarray (ndims, sizes, subsizes, starts, MPI_ORDER_C,
MPI_FLOAT, &recvsubarray);
MPI_Type_commit (&recvsubarray);
MPI_Type_create_resized (recvsubarray, 0, 1 * sizeof(float),
&resizedrecvsubarray);
MPI_Type_commit (&resizedrecvsubarray);
}
// Gather the send matrices into the receive matrix:
int counts[NR_OF_PROCESSES] = {3, COLUMNS - 3};
int displs[NR_OF_PROCESSES] = {0, 3};
MPI_Gatherv (matrix[0], 1, sendsubarray,
matrix[0], counts, displs, resizedrecvsubarray,
root, MPI_COMM_WORLD);
// Have the root send the main array to the output:
if (my_rank == root) print_matrix (matrix, ROWS, COLUMNS);
// Free out all the allocations we created in this node...
if (my_rank == 0) {
MPI_Type_free (&resizedrecvsubarray);
MPI_Type_free (&recvsubarray);
}
MPI_Type_free (&sendsubarray);
free (matrix);
MPI_Finalize();
return 0;
}
I am thinking that maybe there is no straightforward solution to my little problem, as exemplified by the code above, and therefore I will have to settle for some convoluted many-step solution where I have to handle the different width subarrays in separate ways before gathering them into the receiving array in two or three steps, instead of just one.
Any help will be much appreciated!
Nicely done! There’s a lot of juggling around of MPI details in there, and there was only one last thing missing — I only needed to add two lines and change a third to get your code to work.
The fact that you’ve mostly got things working is evidenced even in the mangled output; the right number of “2”s are being received, so you’re constructing the send type and sending your data correctly. The only trick is in the receive.
From the Gatherv code,
You’ve correctly decided to receive in units of columns (thus the first has 3 columns to send, and the second the rest); and your displacements make sense given your resize; you’ve resized things in units of array elements, so each column correctly follows immediately after the next.
The only snag was in your receive subarray type construction; when you make this call
you’re creating a receive type which is, for the receiving process, the size, subsize, and offset of the data it’s sending! Instead, you just want to create a receive subarray type of exactly one column, and with a start of {0,0} — so there’s no (intrinsic) offset, so you can just point it where it needs to go with your displacements:
When I run it with that, it works.
(As a (much) more minor note, you don’t need to commit, or thus free,
recvsubarray, as you never use it for actual communication; it’s only used in constructing theresizedrecvsubarraytype, which is then commited.)