I have written a function which reads an input buffer of bytes and produces an output buffer of words where every word can be either 0x0081 for each ON bit of the input buffer or 0x007F for each OFF bit. The length of the input buffer is given. Both arrays have enough physical place. I also have about 2Kbyte free RAM which I can use for lookup tables or so.
Now, I found that this function is my bottleneck in a real time application. It will be called very frequently. Can you please suggest a way how to optimize this function? I see one possibility could be to use only one buffer and do in-place substitution.
void inline BitsToWords(int8 *pc_BufIn,
int16 *pw_BufOut,
int32 BufInLen)
{
int32 i,j,z=0;
for(i=0; i<BufInLen; i++)
{
for(j=0; j<8; j++, z++)
{
pw_BufOut[z] =
( ((pc_BufIn[i] >> (7-j))&0x01) == 1?
0x0081: 0x007f );
}
}
}
Please do not offer any library-, compiler specific or CPU/Hardware specific optimization, because it is a multi-platform project.
Your lookup tables can placed in a
constarray at compile time, so it could be in ROM – does this give you room for the straightforward 4KB table?If you can afford 4KB of ROM space, the only problem is building the table as an initialized array in a
.cfile – but that only has to be done once, and you can write a script to do it (which may help ensure it’s correct and may also help if you decide that the table needs to change for some reason in the future).You’d have to profile to ensure that the copy from ROM to the destination array is actually faster than calculating what needs to go into the destination – I wouldn’t be surprised if something along the lines of:
ends up being faster. I’d expect that an optimized build of that function would have everything in registers or encoded into the instructions except for a single read of each input byte and a single write of each output word. Or pretty close to that.
You might be able to further optimize by acting on more than one input byte at a time, but then you have to deal with alignment issues and how to handle input buffers that aren’t a multiple of the chunk size you’re dealing with. Those aren’t problems that are too hard to deal with, but they do complicate things, and it’s unclear what kind of improvement you might be able to expect.