Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9005329
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T01:06:55+00:00 2026-06-16T01:06:55+00:00

Good day everyone! I’m conducting a molecular dynamics simulation, and recently I began to

  • 0

Good day everyone!

I’m conducting a molecular dynamics simulation, and recently I began to try to implement it in parallel. At first sight everything looked simple enough: write #pragma omp parallel for directive in front of the most time consuming loops. But as it happens, functions in those loops operate on arrays, or, to be precise, on arrays which belong to an object of my class that contains all information about particle system and functions acing on this system, so that when I added that #pragma directive before one of the most time consuming loops, the computation time actually increased several times despite the fact that my 2 core 4 thread processor was fully loaded.

In order to sort this out I wrote another, simpler program. This test program performs two identical loops, one in parallel, and the second one – in serial. The time it takes to execute both of these loops is measured. The results surprised me: whenever the first loop was computed in parallel, its computation time decreased in comparison with serial mode (1500 and 6000 ms respectively), but the computation time of the second loop increased drastically (15 000 against 6000 in serial).

I tried to use private() and firstprivate() clauses, but the results were the same. Shouldn’t every variable defined and initialized before parallel region be shared automatically anyway? The computation time of the second loop gets back to normal if performed on another vector: vec2, but creating a new vector for every iteration is, clearly, not an option. I’ve also tried to put actual update of vec1 into #pragma omp critical area, but that wasn’t any good either. Neither helped adding Shared(vec1) clause.

I would appreciate if you could point out my errors and show the proper way.

Is it necessary to put that private(i) into the code?

Here is this test program:

#include "stdafx.h"
#include <omp.h>
#include <array>
#include <time.h>
#include <vector>
#include <iostream>
#include <Windows.h>
using namespace std;
#define N1  1000
#define N2  4000
#define dim 1000

int main(){
    vector<int>res1,res2;
    vector<double>vec1(dim),vec2(N1);
    clock_t t, tt;
    int k=0;
    for( k = 0; k<dim; k++){
        vec1[k]=1;
    }

    t = clock();

    #pragma omp parallel 
        {
        double temp; 
        int i,j,k;
        #pragma omp for private(i)
            for( i = 0; i<N1; i++){
                for(j = 0; j<N2; j++){  
                    for( k = 0; k<dim; k++){
                        temp+= j;
                    }
                }
                vec1[i]+=temp;
                temp = 0;
            }
        }
    tt = clock();
    cout<<tt-t<<endl;
    for(int k = 0; k<dim; k++){
        vec1[k]=1;
    }
    t = clock();
                for(int g = 0; g<N1; g++){
        for(int h = 0; h<N2; h++){
            for(int y = 0; y<dim; y++){
                vec1[g]+=h; 
            }
        }
    }
    tt = clock();
    cout<<tt-t<<endl;
    getchar();
}

Thank you for your time!

P.S. I use visual studio 2012, My processor is Intel Core i3-2370M.
My assembly file in two parts:

http://pastebin.com/suXn35xj

http://pastebin.com/EJAVabhF

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T01:06:57+00:00Added an answer on June 16, 2026 at 1:06 am

    Congratulations! You have exposed yet another bad OpenMP implementation, courtesy of Microsoft. My initial theory was that the problem comes from the partitioned L3 cache in Sandy Bridge and later Intel CPUs. But the result from running the second loop only on the first half of the vector did not confirm that theory. Then it has to be something in the code generator that is triggered when OpenMP is enabled. The assembly output confirms this.

    Basically the compiler does not optimise the serial loop when compiling with OpenMP enabled. That’s where the slowdown comes from. Part of the problem was also introduced by yourself by making the second loop not identical to the first one. In the first loop you accumulate intermediate values into a temporary variable, which the compiler optimises to register variable, while in the second case you invoke operator[] on each iteration. When you compile without OpenMP enabled, the code optimiser transforms the second loop into something which is quite similar to the first loop, hence you get almost the same run time for both loops.

    When you enable OpenMP, the code optimiser does not optimise the second loop and it runs way slower. The fact that your code executes a parallel block before that has nothing to do with the slowdown. My guess is that the code optimiser is unable to grasp the fact that vec1 is outside of the scope of the OpenMP parallel region and hence it should no longer be treated as shared variable and the loop can be optimised. Obviously this is a “feature”, which was introduced in Visual Studio 2012, since the code generator in Visual Studio 2010 is able to optimise the second loop even with OpenMP enabled.

    One possible solution would be to migrate to Visual Studio 2010. Another (hypothetical, since I don’t have VS2012) solution would be to extract the second loop into a function and to pass the vector by reference to it. Hopefully the compiler would be smart enough to optimise the code in the separate function.

    This is a very bad trend. Microsoft have practically given up on supporting OpenMP in Visual C++. Their implementation still (almost) conforms to OpenMP 2.0 only (hence no explicit tasks and other OpenMP 3.0+ goodies) and bugs like this one do not make things any better. I would recommend that you switch to another OpenMP enabled compiler (Intel C/C++ Compiler, GCC, anything non-Microsoft) or switch to some other compiler independent threading paradigm, for example Intel Threading Building Blocks. Microsoft is clearly pushing their parallel library for .NET and that’s where all the development goes.


    Big Fat Warning

    Do not use clock() to measure the elapsed wall-clock time! This only works as expected on Windows. On most Unix systems (including Linux) clock() actually returns the total consumed CPU time by all threads in the process since it was created. This means that clock() may return values which are either several times larger than the elapsed wall-clock time (if the program runs with many busy threads) or several times shorter that the wall-clock time (if the program sleeps or waits on IO events between the measurements). Instead, in OpenMP programs, the portable timer function omp_get_wtime() should be used.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Good day everyone. Recently I was given a task to develop an application to
Good day everyone! I decided to try making an extension for Chrome for the
Good day everyone, I'm an independent game developer who has, in the past, primarily
Good day everyone, I am building a page in ASP.NET, and using Master Pages
Good day everyone. I am working on a Firefox extension, and I want to
Good Day Everyone... Apparently, I'm not setting-up impersonation correctly for my WCF service. I
Good Day Everyone... I’m getting an unexpected WCF error complaining of Known Types which
Good day everyone. I have been having the same problem all day at work
Good day everyone! I have a problem regarding my date. It needs to be
Hi and good day everyone, as per above title, I was trying to handle

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.