The problem can be described as follow.
Input
__m256d a, b, c, d
Output
__m256d s = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3],
c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}
Work I have done so far
It seemed easy enough: two VHADD with some shuffling in-between but in fact combining all permutations featured by AVX can’t generate the very permutation needed to achieve that goal. Let me explain:
VHADD x, a, b => x = {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
VHADD y, c, d => y = {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
Were I able to permute x and y in the same manner to get
x1 = {a[0]+a[1], a[2]+a[3], c[0]+c[1], c[2]+c[3]}
y1 = {b[0]+b[1], b[2]+b[3], d[0]+d[1], d[2]+d[3]}
then
VHADD s, x1, y1 => s1 = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3],
c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}
which is the result I wanted.
Thus I just need to find how to perform
x,y => {x[0], x[2], y[0], y[2]}, {x[1], x[3], y[1], y[3]}
Unfortunately I came to the conclusion that this is provably impossible using any combination of VSHUFPD, VBLENDPD, VPERMILPD, VPERM2F128, VUNPCKHPD, VUNPCKLPD. The crux of the matter is that it is impossible to swap u[1] and u[2] in an instance u of __m256d.
Question
Is this really a dead end? Or have I missed a permutation instruction?
VHADDinstructions are meant to be followed by regularVADD. The following code should give you what you want:This gives the result in 5 instructions. I hope I got the constants right.
The permutation that you proposed is certainly possible to accomplish, but it takes multiple instructions. Sorry that I’m not answering that part of your question.
Edit: I couldn’t resist, here’s the complete permutation. (Again, did my best to try to get the constants right.) You can see that swapping
u[1]andu[2]is possible, just takes a bit of work. Crossing the 128bit barrier is difficult in the first gen. AVX. I also want to say thatVADDis preferable toVHADDbecauseVADDhas twice the throughput, even though it’s doing the same number of additions.