Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 763367
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T16:32:58+00:00 2026-05-14T16:32:58+00:00

Below is my innermost loop that’s run several thousand times, with input sizes of

  • 0

Below is my innermost loop that’s run several thousand times, with input sizes of 20 – 1000 or more. This piece of code takes up 99 – 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?

I’m not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.

Any help is appreciated!

Edit: I’m running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)

typedef double real;

struct Particle
{
    Vector pos, vel, acc, jerk;
    Vector oldPos, oldVel, oldAcc, oldJerk;
    real mass;
};

class Vector
{
private:
    real vec[3];

public:
    // Operators defined here
};

real Gravity::interact(Particle *p, size_t numParticles)
{
    PROFILE_FUNC();
    real tau_q = 1e300;

    for (size_t i = 0; i < numParticles; i++)
    {
        p[i].jerk = 0;
        p[i].acc = 0;
    }

    for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {
            Vector r = p[j].pos - p[i].pos;
            Vector v = p[j].vel - p[i].vel;
            real r2 = lengthsq(r);
            real v2 = lengthsq(v);

            // Calculate inverse of |r|^3
            real r3i = Constants::G * pow(r2, -1.5);

            // da = r / |r|^3
            // dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
            Vector da = r * r3i;
            Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;

            // Calculate new acceleration and jerk
            p[i].acc += da * p[j].mass;
            p[i].jerk += dj * p[j].mass;
            p[j].acc -= da * p[i].mass;
            p[j].jerk -= dj * p[i].mass;

            // Collision estimation
            // Metric 1) tau = |r|^2 / |a(j) - a(i)|
            // Metric 2) tau = |r|^4 / |v|^4
            real mij = p[i].mass + p[j].mass;
            real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
            real tau_est_q2 = (r2*r2) / (v2*v2);

            if (tau_est_q1 < tau_q)
                tau_q = tau_est_q1;
            if (tau_est_q2 < tau_q)
                tau_q = tau_est_q2;
        }
    }

    return sqrt(sqrt(tau_q));
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T16:32:58+00:00Added an answer on May 14, 2026 at 4:32 pm
    1. Inline the calls to lengthsq().

    2. Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5

    3. Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn’t aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.

    4a. Try replacing the if (…) tau_q = with

        tau_q=minimum(...,...)
    

    Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.

    4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]

    1. [EDIT: I suggested the following, but in fact it isn’t valid for the OP’s code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.

    Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I understand how to process this doc (below) with XSLT from the outermost person
I get the error below on several views. Which seems identical to what is
Below is the link for making the cover flow model in iphone. This is
Below I have a very simple example of what I'm trying to do. I
Below are two ways of reading in the commandline parameters. The first is the
Below is my current char* to hex string function. I wrote it as an
Below is part of the XML which I am processing with PHP's XSLTProcessor :
Below is my $.ajax call, how do I put a selects (multiple) selected values
Below is the code of a simple html with a table layout. In FF
Below are lines from the c++ programming language template<class T > T sqrt(T );

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.