I am thinking about implementing a wrapper for MPI that imitates OpenMP’s way of

Question

0

Asked: June 10, 20262026-06-10T15:24:04+00:00 2026-06-10T15:24:04+00:00

I am thinking about implementing a wrapper for MPI that imitates OpenMP’s way of

0

I am thinking about implementing a wrapper for MPI that imitates OpenMP’s way
of parallelizing for loops.

  begin_parallel_region( chunk_size=100 , num_proc=10 );

  for( int i=0 ; i<1000 ; i++ )
  {
       //some computation 
  }

  end_parallel_region();

The code above distributes computation inside the for loop to 10 slave MPI processors.
Upon entering the parallel region, the chunk size and number of slave processors are provided.
Upon leaving the parallel region, the MPI processors are synched and are put idle.

EDITED in response to High Performance Mark.

I have no intention to simulate the OpenMP’s shared memory model.
I propose this because I need it.
I am developing a library that is required to build graphs from mathetical functions.
In these mathetical functions, there often exist for loops like the one below.

 for( int i=0 ; i<n ; i++ )
 {
          s = s + sin(x[i]);
 }

So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.

I was wondering if there is such a wrapper out there so that I don’t have to reinvent the wheel.

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T15:24:06+00:00

Yes, you could do this, for specific tasks. But you shouldn’t.

Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>

typedef struct state_t {
    int globaln;
    int localn;
    int *locals;
    int *offsets;
    double *localin;
    double *localout;
    double (*map)(double);
} state;

state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
    state *s = malloc(sizeof(state));
    s->globaln = n;
    s->map = map;

    /* figure out decomposition */

    int size, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    s->locals  = malloc(size * sizeof(int));
    s->offsets = malloc(size * sizeof(int));

    s->offsets[0] = 0;

    for (int i=0; i<size; i++) {
        s->locals[i] = (n+i)/size;
        if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
    }

    /* allocate local arrays */
    s->localn   = s->locals[rank];
    s->localin  = malloc(s->localn*sizeof(double));
    s->localout = malloc(s->localn*sizeof(double));


    /* distribute */
    MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
                  s->localin, s->locals[rank], MPI_DOUBLE,
                  0, MPI_COMM_WORLD);

    return s;
}

double  end_parallel_mapandsum(state **s) {
    double localanswer=0., answer;

    /* sum up local answers */
    for (int i=0; i<((*s)->localn); i++) {
        localanswer += ((*s)->localout)[i];
    }

    /* and get global result.  Everyone gets answer */
    MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

    free( (*s)->localin );
    free( (*s)->localout );
    free( (*s)->locals );
    free( (*s)->offsets );
    free( (*s) );

    return answer;
}


int main(int argc, char **argv) {
    int rank;
    double *inputs;
    double result;
    int n=100;
    const double pi=4.*atan(1.);

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) {
        inputs = malloc(n * sizeof(double));
        for (int i=0; i<n; i++) {
            inputs[i] = 2.*pi/n*i;
        }
    }

    state *s=begin_parallel_mapandsum(inputs, n, sin);

    for (int i=0; i<s->localn; i++) {
        s->localout[i] = (s->map)(s->localin[i]);
    }

    result = end_parallel_mapandsum(&s);

    if (rank == 0) {
        printf("Calculated result: %lf\n", result);
        double trueresult = 0.;
        for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
        printf("True  result: %lf\n", trueresult);
    }

    MPI_Finalize();

}

That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.

To a first approximation, shared memory approaches – OpenMP, pthreads, IPP, what have you – are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you’re dealing with problem sets which can’t fit on any one node’s memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.

It’s important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data – that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just

 for( int i=0 ; i<localn ; i++ )
    {
          s = s + sin(x[i]);
    }

and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am thinking about implementing a wrapper for MPI that imitates OpenMP’s way of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply