Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8136633
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T10:49:22+00:00 2026-06-06T10:49:22+00:00

I have the following function (from the opensource project recast navigation ): /// Derives

  • 0

I have the following function (from the opensource project “recast navigation”):

/// Derives the dot product of two vectors on the xz-plane. (@p u . @p v)
///  @param[in]     u       A vector [(x, y, z)]
///  @param[in]     v       A vector [(x, y, z)]
/// @return The dot product on the xz-plane.
///
/// The vectors are projected onto the xz-plane, so the y-values are ignored.
inline float dtVdot2D(const float* u, const float* v)
{
    return u[0]*v[0] + u[2]*v[2];
}

I ran it through the VS2010 CPU performance test and it showed me that in all recast codebase codeline in this function u[0]*v[0] + u[2]*v[2] is most CPU hot .

How can I CPU optimize (via SSE or GLSL like GLM (if it is easier or faster and appropriate in such case) for example) this line?

Edit: The context in which the calls appear:

bool dtClosestHeightPointTriangle(const float* p, const float* a, const float* b, const float* c, float& h) {
    float v0[3], v1[3], v2[3];
    dtVsub(v0, c,a);
    dtVsub(v1, b,a);
    dtVsub(v2, p,a);

    const float dot00 = dtVdot2D(v0, v0);
    const float dot01 = dtVdot2D(v0, v1);
    const float dot02 = dtVdot2D(v0, v2);
    const float dot11 = dtVdot2D(v1, v1);
    const float dot12 = dtVdot2D(v1, v2);

    // Compute barycentric coordinates
    const float invDenom = 1.0f / (dot00 * dot11 - dot01 * dot01);
    const float u = (dot11 * dot02 - dot01 * dot12) * invDenom;
    const float v = (dot00 * dot12 - dot01 * dot02) * invDenom;
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T10:49:24+00:00Added an answer on June 6, 2026 at 10:49 am

    After trying a few things on paper I’ve come up with something that might work for you. It is a properly parallelized/vectorized implementation of the function in SSE.

    It does however require data reorganization because we’ll do parallel processing for 4 triangles at once.

    I will break it up in steps and use instruction names here and there, but please use the C intrinsics (_mm_load_ps(), _mm_sub_ps() et al, they’re in xmmintrin.h in VC) — when I speak of registers this just means __m128.

    STEP 1.

    We do not need the Y coordinate at all, so we set up pointers to pairs of X and Z. Supply at least 4 pairs (i.e. 4 triangles total) per call. I’ll call each X and Z pair a vertex.

    STEP 2.

    Use MOVAPS (requires the pointers to be aligned to 16-bit) to load the first two vertices pointed to by each pointer into registers.

    A register loaded from a will look like this: [ a0.x, a0.z, a1.x, a1.z ]

    STEP 3.

    Now, using a single subtract instruction, you can calculate the deltas (your v0, v1, v2) for 2 vertices at once.

    Calculate v0, v1 and v2 not only for the first 2 triangles, but also for the latter 2!
    As I said you should supply a total of 4 vertices, or 8 floats, per input. Just repeat steps 2 and 3 for that data.

    Now we have 2 pairs of vx registers, each pair containing the result for 2 triangles. I’ll refer to them as vx_0 (first pair) and vx_1 (second pair) where x is from 0 to 2.

    STEP 4.

    Dot products. In order to parallelize the barycentric calculation (later on) we require the result of each dot product for each of the 4 triangles, in 1 single register.

    So where you’d calculate dot01 for example, we will do the same, but for 4 triangles at once. Each v-register contains the result for 2 vectors, so we begin by multiplying them.

    Let us say that u and v — parameters in your dot product function — are now v0_0 and v1_0 (as to calculate dot01):

    Multiply u and v to get: [ (v0_0.x0) * (v1_0.x0), (v0_0.z0) * (v1_0.z0), (v0_0.x1) * (v1_0.x1), (v0_0.z1) * (v1_0.z1) ]

    This may look confusing because of .x0 / .z0 and .x1 / .z1, but look at what was loaded at step 2: a0, a1.

    If by now this still feels fuzzy, pick up a piece of paper and write along from the start.

    Next, still for the same dot product, do the multiplication for v0_1 and v1_1 (i.e. the second pair of triangles).

    Now we have 2 registers with 2 X & Z pairs each (4 vertices total), multiplied and ready to be added together to form 4 separate dot products. SSE3 has an instruction to do exactly this, and it’s called HADDPS:

    xmm0 = [A, B, C, D]
    xmm1 = [E, F, G, H]

    HADDPS xmm0, xmm1 does this:

    xmm0 = [A+B, C+D, E+F, G+H]

    It will take the X & Z pairs from our first register, those from the second, add them together and store them in the first, second, third and fourth component of the destination register. Ergo: at this point we’ve got this particular dot product for all 4 triangles!

    **Now repeat this process for all the dot products: dot00 et cetera. **

    STEP 5.

    The last calculation (as far as I could see by the supplied code) is the barycentric stuff. This is a 100% scalar calculation in your code. But your inputs now are not scalar dot product results (i.e. single floats), they are SSE vectors/registers with a dot product for each of the 4 triangles.

    So if you flat out vectorize this by using the parallel SSE operations that operate on all 4 floats, you’ll eventually end up with 1 register (or result) carrying 4 heights, 1 for each triangle.

    Since my lunch break is well past due I’m not going to spell this one out, but given the setup/idea I’ve given this is a last step and shouldn’t be hard to figure out.

    I know that this idea is a bit of a stretch and requires some loving from the code that sits above it and perhaps some quality time with pencil and paper, but it will be fast (and you can even add OpenMP afterwards if you’d like).

    Good luck 🙂

    (and forgive my fuzzy explanation, I can whip up the function if need be =))

    UPDATE

    I’ve written an implementation and it did not go as I expected, mainly because the Y component got involved beyond the piece of code that you initially pasted in your question ( I looked it up). What I’ve done here is not just ask you to rearrange all points in to XZ pairs and feed them per 4, but also feed 3 pointers (for points A, B and C) with the Y values for each of the 4 triangles. From a local perspective this is fastest. I can still modify it to require less intrusive changes from the callee’s end, but please let me know what’s desirable.

    Then a disclaimer: this code is straightforward as hell (something I’ve found to work pretty well with compilers in terms of SSE… they can reorganize as see fit and x86/x64 CPUs take their share in it too). Also the naming, it’s not my style, it’s not anyone’s, just do with it what you see fit.

    Hope it helps and if not I’ll gladly go over it again. And if this is a commercial project there’s also the option of getting me on board I guess 😉

    Anyway, I’ve put it on pastebin: http://pastebin.com/20u8fMEb

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have the following function call from a thread: Thread Move = new Thread(){
I have the following function that retrieves an image from a twitter feed, the
I have the following function in ruby for retrieving certain information from a database.
I have following repository: public function findClassPhotoByPath($path) { return $this->getEntityManager() ->createQuery('SELECT p FROM KSRGalleryBundle:Photo
I have the following piece of code from a book. There is this function
I have the following function from some legacy code that I am maintaining. long
I have the following function to get an int from a high-byte and a
I have the following function that draws some data from a chi-squared distribution and
I have the following function which takes a tiff stored as blob from an
I want to call AS function from JS. I have the following ActionScript 3

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.