Modern CPU’s can perform extended multiplication between two native-size words and store the low

Question

0

Asked: June 14, 20262026-06-14T01:21:46+00:00 2026-06-14T01:21:46+00:00

Modern CPU’s can perform extended multiplication between two native-size words and store the low

0

Modern CPU’s can perform extended multiplication between two native-size words and store the low and high result in separate registers. Similarly, when performing division, they store the quotient and the remainder in two different registers instead of discarding the unwanted part.

Is there some sort of portable gcc intrinsic which would take the following signature:

void extmul(size_t a, size_t b, size_t *lo, size_t *hi);

Or something like that, and for division:

void extdiv(size_t a, size_t b, size_t *q, size_t *r);

I know I could do it myself with inline assembly and shoehorn portability into it by throwing #ifdef’s in the code, or I could emulate the multiplication part using partial sums (which would be significantly slower) but I would like to avoid that for readability. Surely there exists some built-in function to do this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T01:21:48+00:00

For gcc since version 4.6 you can use __int128. This works on most 64 bit hardware. For instance

To get the 128 bit result of a 64×64 bit multiplication just use

void extmul(size_t a, size_t b, size_t *lo, size_t *hi) {
    __int128 result = (__int128)a * (__int128)b;
    *lo = (size_t)result;
    *hi = result >> 64;
}

On x86_64 gcc is smart enough to compile this to

   0:   48 89 f8                mov    %rdi,%rax
   3:   49 89 d0                mov    %rdx,%r8
   6:   48 f7 e6                mul    %rsi
   9:   49 89 00                mov    %rax,(%r8)
   c:   48 89 11                mov    %rdx,(%rcx)
   f:   c3                      retq

No native 128 bit support or similar required, and after inlining only the mul instruction remains.

Edit: On a 32 bit arch this works in a similar way, you need to replace __int128_t by uint64_t and the shift width by 32. The optimization will work on even older gccs.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Modern CPU’s can perform extended multiplication between two native-size words and store the low

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply