So apparently on my machine, float, double and long double each have different sizes

Question

0

Asked: May 24, 20262026-05-24T21:13:14+00:00 2026-05-24T21:13:14+00:00

So apparently on my machine, float, double and long double each have different sizes

0

So apparently on my machine, float, double and long double each have different sizes each. There also doesn’t seem to be a strict standard enforcing how many bytes each of those types would have to be.

How would one, then, save a floating point value into a binary file, and then have it read properly on a different system if the sizes differ? e.g my machine has 8 byte doubles, whereas joe’s have 12 byte doubles.

Without having to export it in text form (e.g “0.3232”), that is. Seems a lot less compact than the binary representation.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T21:13:15+00:00

You have to define a format, and implement that. Typically, most of the
network protocols I know use IEEE float and double, output big-endian
(but other formats are possible). The advantage of using IEEE formats
is that it is what most of the current everyday machines use
internally; if you’re on one of these machines (and portability of your
code to other machines, like mainframes, isn’t an issue), you can
“convert” to the format simply by type-punning to an unsigned int of the
same size, and outputting that. So, for example, you might have:

obstream&
operator<<( obstream& dest, uint64_t value )
{
    dest.put((value >> 56) & 0xFF);
    dest.put((value >> 48) & 0xFF);
    dest.put((value >> 40) & 0xFF);
    dest.put((value >> 32) & 0xFF);
    dest.put((value >> 24) & 0xFF);
    dest.put((value >> 16) & 0xFF);
    dest.put((value >>  8) & 0xFF);
    dest.put((value      ) & 0xFF);
    return dest;
}

obstream&
operator<<( obstream& dest, double value )
{
    return dest << reinterpret_cast<uint64_t const&>( value );
}

If you have to be portable to a machine not supporting IEEE (e.g. any of
the modern mainframes), you’ll need something a bit more complicated:

obstream&
obstream::operator<<( obstream& dest, double value )
{
    bool                isNeg = value < 0;
    if ( isNeg ) {
        value = - value;
    }
    int                 exp;
    if ( value == 0.0 ) {
        exp = 0;
    } else {
        value = ldexp( frexp( value, &exp ), 53 );
        exp += 1022;
    }
    uint64_t mant = static_cast< uint64_t >( value );
    dest.put( (isNeg ? 0x80 : 0x00) | exp >> 4 );
    dest.put( ((exp << 4) & 0xF0) | ((mant >> 48) & 0x0F) );
    dest.put( mant >> 40 );
    dest.put( mant >> 32 );
    dest.put( mant >> 24 );
    dest.put( mant >> 16 );
    dest.put( mant >>  8 );
    dest.put( mant       );
    return dest;
}

(Note that this doesn’t handle NaN’s and infinities correctly.
Personally, I would ban them from the format, since not all floating
point representations support them. But then, there’s no floating point
format on an IBM mainframe which will support 1E306, either, although
you can encode it in the IEEE double format above.)

Reading is, of course, the opposite. Either:

ibstream&
operator>>( ibstream& source, uint64_t& results )
{
    uint64_t value = (source.get() & 0xFF) << 56;
    value |= (source.get() & 0xFF) << 48;
    value |= (source.get() & 0xFF) << 40;
    value |= (source.get() & 0xFF) << 32;
    value |= (source.get() & 0xFF) << 24;
    value |= (source.get() & 0xFF) << 16;
    value |= (source.get() & 0xFF) <<  8;
    value |= (source.get() & 0xFF)      ;
    if ( source )
        results = value;
    return source;
}

ibstream&
operator>>( ibstream& source, double& results)
{
    uint64_t tmp;
    source >> tmp;
    if ( source )
        results = reinterpret_cast<double const&>( tmp );
    return source;
}

or if you can’t count on IEEE:

ibstream&
ibstream::operator>>( ibstream& source, double& results )
{
    uint64_t tmp;
    source >> tmp;
    if ( source ) {
        double f = 0.0;
        if ( (tmp & 0x7FFFFFFFFFFFFFFF) != 0 ) {
            f = ldexp( ((tmp & 0x000FFFFFFFFFFFFF) | 0x0010000000000000),
                       static_cast<int>( (tmp & 0x7FF0000000000000) >> 52 )
                                - 1022 - 53 );
        }
        if ( (tmp & 0x8000000000000000) != 0 ) {
            f = -f;
        }
        dest = f;
    }
    return source;
}

(This assumes that the input is not an NaN or an infinity.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So apparently on my machine, float, double and long double each have different sizes

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply