I am writing a Linux Kernel driver (for ARM) and in an irq handler I need to check the interrupt bits.
bit
0/16 End point 0 In/Out interrupt
(very likely, while In is more likely)
1/17 End point 1 In/Out interrupt
...
15/31 End point 15 In/Out interrupt
Note that more than a bit can be set at a time.
So this is the code:
int i;
u32 intr = read_interrupt_register();
/* ep0 IN */
if(likely(intr & (1 << 0))){
handle_ep0_in();
}
/* ep0 OUT */
if(likely(intr & (1 << 16))){
handle_ep0_out();
}
for(i=1;i<16;++i){
if(unlikely(intr & (1 << i))){
handle_ep_in(i);
}
if(unlikely(intr & (1 << (i + 16)))){
handle_ep_out(i);
}
}
(1 << 0) and (1 << 16) would be calculated in compile time, however (1 << i) and (1 << (i + 16)) wouldn’t. Also there would be integral comparison and addition in the loop.
Because it is an irq handler, work should be done within the shortest time. This let me think whether I need to optimize it a bit.
Possible ways?
1. Split the loop, seems to make no difference…
/* ep0 IN */
if(likely(intr & (1 << 0))){
handle_ep0_in();
}
/* ep0 OUT */
if(likely(intr & (1 << 16))){
handle_ep0_out();
}
for(i=1;i<16;++i){
if(unlikely(intr & (1 << i))){
handle_ep_in(i);
}
}
for(i=17;i<32;++i){
if(unlikely(intr & (1 << i))){
handle_ep_out(i - 16);
}
}
2. Shift intr instead of the value to be compared to?
/* ep0 IN */
if(likely(intr & (1 << 0))){
handle_ep0_in();
}
/* ep0 OUT */
if(likely(intr & (1 << 16))){
handle_ep0_out();
}
for(i=1;i<16;++i){
intr >>= 1;
if(unlikely(intr & 1)){
handle_ep_in(i);
}
}
intr >>= 1;
for(i=1;i<16;++i){
intr >>= 1;
if(unlikely(intr & 1)){
handle_ep_out(i);
}
}
3. Fully unroll the loop (not shown). That would make the code a bit messy.
4. Any other better ways?
5. Or it’s that the compiler will actually generate the most optimized way?
Edit: I was looking for a way to tell the gcc compiler to unroll that particular loop, but it seems that it isn’t possible according to my search…
If we can assume that the number of set bits in intr is low (as it is usually the case in interrupt masks) we can optimize a little bit and write a loop that executes for each bit only once:
On most ARM architectures __builtin_ffs will compile down to a CLZ instruction and some arithmetic around it. It should do so for anything but ARM7 and older cores.
Also: When writing interrupt handlers on embedded devices the size of the function makes a difference for performance as well because the instructions have to be loaded into the code-cache. Lean code usually executes faster. A bit overhead is okay if you save memory accesses to memory that is unlikely to be in the cache.