Here’s the code:
#include <stdio.h>
#include <math.h>
static double const x = 665857;
static double const y = 470832;
int main(){
double z = x*x*x*x -(y*y*y*y*4+y*y*4);
printf("%f \n",z);
return 0;
}
Mysteriously (to me) this code prints “0.0” if compiled on 32 bits machines (or with the -m32 flag on 64 bits machines like in my case) with GCC 4.6. As far as I know about floating point operations, it is possible to overflow/underflow them or to lose precision with them, but… a 0? How?
Thanks in advance.
This is result of the way IEEE 754 represents floating point numbers in normalized form. float or double or whatever other IEEE 754 compliant representation is stored like:
where
xxxxxxxxxxxxxxxxxxxis the fractional part of the mantissa so the mantissa itself is always in the range [1, 2). The integer part which is always 1 is not stored in the representation. The number ofxbits defines the precision. It is 52 bits for thedouble. The exponent is stored in an offset form (one must subtract 1023 in order to obtain its value) but that is irrelevant now.665857^4 in 64-bit IEEE 754 is:
(the first bit is the sign bit: 0 = positive, 1 – negative; the bit in parentheses is not really stored)
In 80-bit x86 extended precision it is:
(here the integer part is explicitly part of the representation – a deviation from IEEE 754; I’ve aligned the mantissas for clarity)
4*470832^4 in 64-bit IEEE 754 and 80-bit x86 extended precision is:
4*470832^2 in 64-bit IEEE 754 and 80-bit x86 extended precision is:
When you sum up the last two numbers, the procedure is the following: the smaller value has its exponent adjusted to match the larger value’s exponent while the mantissa is shifted to the right in order to preserve the value. Since the two exponents differ by 38, the mantissa of the smaller number is shifted 38 bits to the right:
470832^2*4 in adjusted 64-bit IEEE 754 and 80-bit x86 extended precision:
Now both quantities have the same exponents and their mantissas could be summed:
I kept some of the 80-bit precision bits on the right of the bar, because the summation internally is done in the greater precision of 80 bits.
Now let’s perform the subtraction in 64-bit + some bits of the 80-bit rep:
A pure 0! If you perform the calculations in full 80-bit, you would once again obtain a pure 0.
The real problem here is that 1.0 cannot be represented in 64-bit precision with an exponent of 2^77 – there are no 77 bits of precision in the mantissa. This is also true for the 80-bit precision – there are only 63 bits in the mantissa, 14 bits less than necessary to represent 1.0 given an exponent of 2^77.
So that’s it! It’s just the wonderful world of scientific computing where nothing works the way you were taught in the math classes…