In a program I wrote, 20% of the time is being spent on finding out the minimum of 3 numbers in an inner loop, in this routine:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) m = c;
return m;
}
Is there any way to speed this up? I am ok with assembly code too for x86/x86_64.
Edit: In reply to some of the comments:
* Compiler being used is gcc 4.3.3
* As far as assembly is concerned, I am only a beginner there. I asked for assembly here, to learn how to do this. 🙂
* I have a quad-core Intel 64 running, so MMX/SSE etc. are supported.
* It’s hard to post the loop here, but I can tell you it’s a heavily optimized implementation of the levenshtein algorithm.
This is what the compiler is giving me for the non-inlined version of min:
.globl min
.type min, @function
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
movl 16(%ebp), %ecx
cmpl %edx, %eax
jbe .L2
movl %edx, %eax
.L2:
cmpl %ecx, %eax
jbe .L3
movl %ecx, %eax
.L3:
popl %ebp
ret
.size min, .-min
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",@progbits
The inlined version is within -O2 optimized code (even my markers mrk = 0xfefefefe, before and after the call to min()) are getting optimized away by gcc, so I couldn’t get hold of it.
Update: I tested the changes suggested by Nils, ephemient, however there’s no perceptible performance boost I get by using the assembly versions of min(). However, I get a 12.5% boost by compiling the program with -march=i686, which I guess is because the whole program is getting the benefits of the new faster instructions that gcc is generating with this option. Thanks for your help guys.
P.S. – I used the ruby profiler to measure performance (my C program is a shared library loaded by a ruby program), so I could get time spent only for the top-level C function called by the ruby program, which ends up calling min() down the stack. Please see this question.
Make sure you are using an appropriate
-marchsetting, first off. GCC defaults to not using any instructions that were not supported on the original i386 – allowing it to use newer instruction sets can make a BIG difference at times! On-march=core2 -O2I get:The use of cmov here may help you avoid branch delays – and you get it without any inline asm just by passing in
-march. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.