As part learning exercise, part hobby project, I am implementing my own interpretation of the Cooley-Tukey FFT algorithm on an AVR using fixed point math. I haven’t dealt much with fixed point math before, and am wondering how best to go about part of the implementation. I guess the gist of this question is a request to confirm that I’m thinking about the issues involved correctly.
The heart of the C-T algorithm involves a set of multiplications and additions on complex-valued data in the following fashion (in pseduocode):
temp1 = cosine(increment)*dataRealPart[increment1]
-sine(increment)*dataImaginaryPart[increment1]
temp2 = cosine(increment)*dataImaginaryPart[increment1]
+ sine(increment)*dataRealPart[increment1]
dataRealPart[increment1] = dataRealPart[increment2] - temp1
etc.
The cosine and sine data will be 8 bit signed binary fractions of the form S.XXX’XXXX, the input data will also be 8 bit signed binary fractions of the form SXXX.XXXX, and the multiplications will generate a 16 bit signed fraction product. As I see it, for particularly “bad” values of sine and cosine and the real and imaginary part of the data, temp1 or temp2 will come pretty close to the limits of a 16 bit signed integer.
If both the real part and imaginary part of the data are, say, b0111.1111, a little work in Wolfram Alpha shows that, with “bad” values of sine and cosine, the output can be up to 1.4 times larger than what simply multiplying the maximum value of a sine times the maximum value of the input would be.
For example, if the sine argument is b0111.1111 and the input value is b0111.111, the output would be b0111111.00000001, or 16129 in decimal. 1.4 times that would be about 22580. This won’t overflow the positive range of a signed 16 bit int, but in the next lines these products are added and subtracted from the input data, and assuming the input data here is converted to sixteen bit, it’s likely that an overflow will occur.
It looks to me like the trade off is: either increase the internal processing resolution of the data, which increases the processing time, or make sure the input data stays lower than the amplitude which causes overflow, decreasing signal-to-noise ratio. Is that about the size of things?
One option is to reduce your sine and cosine values to Q6 (6 bit to the right of the decimal) This would make them +/-64. Notice that by going one bit more precision, -1 is representable but +1 is not (i.e. +128). Also, after you multiply there will be 2 sign bits in the result which means a sum of 2 products can probably be added with no problem. With this extra bit from reduced resolution you should really be able to avoid overflow. Another point – if your complex value is limited to a magnitude of 1 (real*real+img*img <=1) then the resulting sum will not exceed 1, as opposed to the 1.4 figure you found – that’s because when sine is 1, cosine is zero. You’re essentially taking a dot-product of a unit vector (cos,sin) and the complex vector. Beyond that, you can shift your 16bit products right a few bits prior to adding them, probably with rounding – you don’t really have 16 bits precision anyway since both data and trig functions were rounded to 7 bits.
One last point. If you’re taking a sum of many numbers and you know the result will always be within your representable range, you do not need to worry about overflows of the intermediate results. It will all just work out in the end (the bits that you throw away all sum to zero anyway).