Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7893127
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 3, 20262026-06-03T07:02:54+00:00 2026-06-03T07:02:54+00:00

The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version

  • 0

The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables. The “Why?” question have been very briefly treated in another forum (here) and on Stack Overflow (here). But the question I am facing is how to replace this missing instruction with other AVX instructions in an efficient way?

The dot product in 256-bit version exists for single precision floating point variables (reference here):

 __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);

The idea is to find an efficient equivalent for this missing instruction:

 __m256d _mm256_dp_pd(__m256d m1, __m256d m2, const int mask);

To be more specific, the code I would like to transform from __m128 (four floats) to __m256d (4 doubles) use the following instructions:

   __m128 val0 = ...; // Four float values
   __m128 val1 = ...; //
   __m128 val2 = ...; //
   __m128 val3 = ...; //
   __m128 val4 = ...; //

   __m128 res = _mm_or_ps( _mm_dp_ps(val1,  val0,   0xF1),
                _mm_or_ps( _mm_dp_ps(val2,  val0,   0xF2),
                _mm_or_ps( _mm_dp_ps(val3,  val0,   0xF4),
                           _mm_dp_ps(val4,  val0,   0xF8) )));

The result of this code is a _m128 vector of four floats containing the results of the dot products between val1 and val0, val2 and val0, val3 and val0, val4 and val0.

Maybe this can give hints for the suggestions?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-03T07:02:55+00:00Added an answer on June 3, 2026 at 7:02 am

    I would use a 4*double multiplication, then a hadd (which unfortunately adds only 2*2 floats in the upper and lower half), extract the upper half (a shuffle should work equally, maybe faster) and add it to the lower half.

    The result is in the low 64 bit of dotproduct.

    __m256d xy = _mm256_mul_pd( x, y );
    __m256d temp = _mm256_hadd_pd( xy, xy );
    __m128d hi128 = _mm256_extractf128_pd( temp, 1 );
    __m128d dotproduct = _mm_add_pd( (__m128d)temp, hi128 );
    

    Edit:
    After an idea of Norbert P. I extended this version to do 4 dot products at one time.

    __m256d xy0 = _mm256_mul_pd( x[0], y[0] );
    __m256d xy1 = _mm256_mul_pd( x[1], y[1] );
    __m256d xy2 = _mm256_mul_pd( x[2], y[2] );
    __m256d xy3 = _mm256_mul_pd( x[3], y[3] );
    
    // low to high: xy00+xy01 xy10+xy11 xy02+xy03 xy12+xy13
    __m256d temp01 = _mm256_hadd_pd( xy0, xy1 );   
    
    // low to high: xy20+xy21 xy30+xy31 xy22+xy23 xy32+xy33
    __m256d temp23 = _mm256_hadd_pd( xy2, xy3 );
    
    // low to high: xy02+xy03 xy12+xy13 xy20+xy21 xy30+xy31
    __m256d swapped = _mm256_permute2f128_pd( temp01, temp23, 0x21 );
    
    // low to high: xy00+xy01 xy10+xy11 xy22+xy23 xy32+xy33
    __m256d blended = _mm256_blend_pd(temp01, temp23, 0b1100);
    
    __m256d dotproduct = _mm256_add_pd( swapped, blended );
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

on my 32-bit machine (with an Intel T7700 duo core), I have 15 precision
Intel's 32-bit processors such as Pentium have 64-bit wide data bus and therefore fetch
Does NEON support aliasing of the vector data types with their scalar components? E.g.(Intel
I'm using rails Version 3.0.3 and ruby Version 1.9.2p136 (2010-12-15) on a Intel Core
Intel is set to release a new instruction set called AVX , which includes
Intel makes microprocessor, so he could define the instruction format and its corresponse assembly
With Intel's launch of a Hexa-Core(6) processor for the desktop, it looks like we
experts,i wonder the intel x86 machineCode/assemblyCode conversion is singleSide or bothSide? means: assemblyCode --->
I am trying out Intel MKL and it appears that they have their own
I'm studying the Intel's IA-32 software developer manual. In particular, I'm reading the following

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.