I need to convert a 32 bit IEEE754 float to a signed Q19.12 fixed-point

Question

0

Asked: June 7, 20262026-06-07T23:59:50+00:00 2026-06-07T23:59:50+00:00

I need to convert a 32 bit IEEE754 float to a signed Q19.12 fixed-point

0

I need to convert a 32 bit IEEE754 float to a signed Q19.12 fixed-point format. The problem is that it must be done in a fully deterministic way, so the usual (int)(f * (1 << FRACTION_SHIFT)) is not suitable, since it uses non-deterministic floating point math. Are there any “bit fiddling” or similar deterministic conversion methods?

Edit: Deterministic in this case is assumed as: given the same floating point data achieve exactly same conversion results on different platforms.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T23:59:52+00:00

While @StephenCanon’s answer might be right about this particular case being fully deterministic, I’ve decided to stay on the safer side, and still do the conversion manually. This is the code I have ended up with (thanks to @CodesInChaos for pointers on how to do this):

public static Fixed FromFloatSafe(float f) {
    // Extract float bits
    uint fb = BitConverter.ToUInt32(BitConverter.GetBytes(f), 0);
    uint sign = (uint)((int)fb >> 31);
    uint exponent = (fb >> 23) & 0xFF;
    uint mantissa = (fb & 0x007FFFFF);

    // Check for Infinity, SNaN, QNaN
    if (exponent == 255) {
        throw new ArgumentException();
    // Add mantissa's assumed leading 1
    } else if (exponent != 0) {
        mantissa |= 0x800000;
    }

    // Mantissa with adjusted sign
    int raw = (int)((mantissa ^ sign) - sign);
    // Required float's radix point shift to convert to fixed point
    int shift = (int)exponent - 127 - FRACTION_SHIFT + 1;

    // Do the shifting and check for overflows
    if (shift > 30) {
        throw new OverflowException();
    } else if (shift > 0) {
        long ul = (long)raw << shift;
        if (ul > int.MaxValue) {
            throw new OverflowException();
        }
        if (ul < int.MinValue) {
            throw new OverflowException();
        }
        raw = (int)ul;
    } else {
        raw = raw >> -shift;
    }

    return Fixed.FromRaw(raw);
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to convert a 32 bit IEEE754 float to a signed Q19.12 fixed-point

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply