Can someone explain to me how I convert a 32-bit floating point value to

Question

0

Editorial Team

Asked: May 15, 20262026-05-15T09:33:39+00:00 2026-05-15T09:33:39+00:00

Can someone explain to me how I convert a 32-bit floating point value to

0

Can someone explain to me how I convert a 32-bit floating point value to a 16-bit floating point value?

(s = sign e = exponent and m = mantissa)

If 32-bit float is 1s7e24m
And 16-bit float is 1s5e10m

Then is it as simple as doing?

int     fltInt32;
short   fltInt16;
memcpy( &fltInt32, &flt, sizeof( float ) );

fltInt16 = (fltInt32 & 0x00FFFFFF) >> 14;
fltInt16 |= ((fltInt32 & 0x7f000000) >> 26) << 10;
fltInt16 |= ((fltInt32 & 0x80000000) >> 16);

I’m assuming it ISN’T that simple … so can anyone tell me what you DO need to do?

Edit: I cam see I’ve got my exponent shift wrong … so would THIS be better?

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x7c000000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

I’m hoping this is correct. Apologies if I’m missing something obvious that has been said. Its almost midnight on a friday night … so I’m not “entirely” sober 😉

Edit 2: Ooops. Buggered it again. I want to lose the top 3 bits not the lower! So how about this:

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x0f800000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

Final code should be:

fltInt16    =  ((fltInt32 & 0x7fffffff) >> 13) - (0x38000000 >> 13);
fltInt16    |= ((fltInt32 & 0x80000000) >> 16);

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T09:33:40+00:00

The exponents in your float32 and float16 representations are probably biased, and biased differently. You need to unbias the exponent you got from the float32 representation to get the actual exponent, and then to bias it for the float16 representation.

Apart from this detail, I do think it’s as simple as that, but I still get surprised by floating-point representations from time to time.

EDIT:

Check for overflow when doing the thing with the exponents while you’re at it.
Your algorithm truncates the last bits of the mantisa a little abruptly, that may be acceptable but you may want to implement, say, round-to-nearest by looking at the bits that are about to be discarded. “0…” -> round down, “100..001…” -> round up, “100..00” -> round to even.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Can someone explain to me how I convert a 32-bit floating point value to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply