I am looking for the fastest way to de/interleave a buffer.
To be more specific, I am dealing with audio data, so I am trying to optimize the time I spend on splitting/combining channels and FFT buffers.
Currently I am using a for loop with 2 index variables for each array, so only plus operations, but all the managed array checks will not compare to a C pointer method.
I like the Buffer.BlockCopy and Array.Copy methods, which cut a lot of time when I process channels, but there is no way for an array to have a custom indexer.
I was trying to find a way to make an array mask, where it would be a fake array with a custom indexer, but that proves to be two times slower when using it in my FFT operation. I guess there are a lot of optimization tricks the compiler can pull when accessing an array directly, but accessing through a class indexer cannot be optimized.
I do not want an unsafe solution, although from the looks of it, that might be the only way to optimize this type of operation.
Thanks.
Here is the type of thing I’m doing right now:
private float[][] DeInterleave(float[] buffer, int channels)
{
float[][] tempbuf = new float[channels][];
int length = buffer.Length / channels;
for (int c = 0; c < channels; c++)
{
tempbuf[c] = new float[length];
for (int i = 0, offset = c; i < tempbuf[c].Length; i++, offset += channels)
tempbuf[c][i] = buffer[offset];
}
return tempbuf;
}
I ran some tests and here is the code I tested:
and results: (buffer size = 2^17, number iterations timed per test = 200)
I got similar results every test. Obviously the unsafe code is the fastest, but I was surprised to see that the CLS couldn’t figure out that that it can drop the index checks when dealing with jagged array. Maybe someone can think of more ways to optimize my tests.
Edit:
I tried loop unrolling with the unsafe code and it didn’t have an effect.
I also tried optimizing the loop unrolling method:
The results are also a hit-miss compared test#4 with 1% difference. Test #4 is my best way to go, so far.
As I told jerryjvl, the problem is getting the CLS to not index check the input buffer, since adding a second check (&& offset < inout.Length) will slow it down…
Edit 2:
I ran the tests before in the IDE, so here are the results outside:
Looks like loop unrolling is not favorable. My optimized code is still my best way to go and with only 10% difference compared to the unsafe code. If only I could tell the compiler that (i < buf.Length) implies that (offset < inout.Length), it will drop the check (inout[offset]) and I will basically get the unsafe performance.