I have some no understanding about how one can cast int to float, step by step? Assume I have a signed integer number which is in binary format. Moreover, I want cast it to float by hand. However, I can’t. Thus, CAn one show me how to do that conversion step by step?
I do that conversion in c, many times ? like;
int a = foo ( );
float f = ( float ) a ;
But, I haven’t figure out what happens at background. Moreover, To understand well, I want do that conversion by hand.
EDIT: If you know much about conversion, you can also give information about for float to double conversion. Moreover, for float to int
Floating point values (IEEE754 ones, anyway) basically have three components:
s;e; andm.The precision dictates how many bits are available for the exponent and mantissa. Let’s examine the value 0.1 for single-precision floating point:
The sign is positive, that’s pretty easy.
The exponent is
64+32+16+8+2+1 = 123 - 127 bias = -4, so the multiplier is 2-4 or1/16. The bias is there so that you can get really small numbers (like 10-30) as well as large ones.The mantissa is chunky. It consists of
1(the implicit base) plus (for all those bits with each being worth 1/(2n) asnstarts at1and increases to the right),{1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}.When you add all these up, you get
1.60000002384185791015625.When you multiply that by the 2-4 multiplier, you get
0.100000001490116119384765625, which is why they say you cannot represent0.1exactly as an IEEE754 float.In terms of converting integers to floats, if you have as many bits in the mantissa (including the implicit 1), you can just transfer the integer bit pattern over and select the correct exponent. There will be no loss of precision. For example a double precision IEEE754 (64 bits, 52/53 of those being mantissa) has no problem taking on a 32-bit integer.
If there are more bits in your integer (such as a 32-bit integer and a 32-bit single precision float, which only has 23/24 bits of mantissa) then you need to scale the integer.
This involves stripping off the least significant bits (rounding actually) so that it will fit into the mantissa bits. That involves loss of precision of course but that’s unavoidable.
Let’s have a look at a specific value,
123456789. The following program dumps the bits of each data type.The output on my system is as follows:
And we’ll look at these one at a time. First the integer, simple powers of two:
Now let’s look at the single precision float. Notice the bit pattern of the mantissa matching the integer as a near-perfect match:
There’s an implicit
1bit to the left of the mantissa and it’s also been rounded at the other end, which is where that loss of precision comes from (the value changing from123456789to123456792as in the output from that program above).Working out the values:
The sign is positive. The exponent is
128+16+8+1 = 153 - 127 bias = 26, so the multiplier is 226 or67108864.The mantissa is
1(the implicit base) plus (as explained above),{1/2, 1/4, 1/16, 1/64, 1/128, 1/512, 1/1024, 1/2048, 1/4096, 1/32768, 1/65536, 1/262144, 1/4194304, 1/8388608}. When you add all these up, you get1.83964955806732177734375.When you multiply that by the 226 multiplier, you get
123456792, the same as the program output.The double bitmask output is:
I am not going to go through the process of figuring out the value of that beast 🙂 However, I will show the mantissa next to the integer format to show the common bit representation:
You can once again see the commonality with the implicit bit on the left and the vastly greater bit availability on the right, which is why there’s no loss of precision in this case.
In terms of converting between floats and doubles, that’s also reasonably easy to understand.
You first have to check the special values such as NaN and the infinities. These are indicated by special exponent/mantissa combinations and it’s probably easier to detect these up front ang generate the equivalent in the new format.
Then in the case where you’re going from double to float, you obviously have less of a range available to you since there are less bits in the exponent. If your double is outside the range of a float, you need to handle that.
Assuming it will fit, you then need to: