I’m writing a compiler that’s generating LLVM IR instructions. I’m working extensively with vectors.
I would like to be able to sum all the elements in a vector. Right now I’m just extracting each element individually and adding them up manually, but it strikes me that this is precisely the sort of thing that the hardware should be able to help with (as it sounds like a pretty common operation). But there doesn’t seem to be an intrinsic to do it.
What’s the best way to do this? I’m using LLVM 3.2.
First of all, even without using intrinsics, you can generate
log(n)vector additions (with n being vector length) instead ofnscalar additions, here’s an example with vector size 8:If your target has support for these vector additions then it seems highly likely the above will be lowered to use those instructions, giving you performance.
Regarding intrinsics, there are no target-independent intrinsics to handle this. If you’re compiling to x86, though, you do have access to the
haddinstrinsics (e.g.llvm.x86.int_x86_ssse3_phadd_sw_128to add two<4 x i32>vectors together). You’ll still have to do something similar to the above, only theaddinstructions could be replaced.For more information about this you can search for “horizontal sum” or “horizontal vector sum”; for instance, here are some relevant stackoverflow questions for a horizontal sum on x86: