I need to do the following operation many times: Take two integers a, b

Question

0

Editorial Team

Asked: May 28, 20262026-05-28T16:52:48+00:00 2026-05-28T16:52:48+00:00

I need to do the following operation many times: Take two integers a, b

0

I need to do the following operation many times:

Take two integers a, b
Compute a * b mod p, where p = 1000000007 and a, b are of the same order of magnitude as p

My gut feeling is the naive

result = a * b
result %= p

is inefficient. Can I optimise multiplication modulo p much like exponentiation modulo p is optimised with pow(a, b, p)?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T16:52:49+00:00

To do this calculation in assembly, but have it callable from Python, I’d
try inline assembly from a
Python module written in C.
Both GCC and
MSVC
compilers feature inline assembly, only with differing syntax.

Note that our modulus p = 1000000007 just fits into 30-bits. The result
desired (a*b)%p can be computed in Intel 80×86 registers given some weak
restrictions on a,b not being much bigger than p.

Restrictions on size of a,b

(1) a,b are 32-bit unsigned integers

(2) a*b is less than p << 32, i.e. p times 2^32

In particular if a,b are each less than 2*p, overflow will be avoided.
Given (1), it also suffices that either one of them is less than p.

The Intel 80×86 instruction MUL can multiply two 32-bit unsigned integers
and store the 64-bit result in accumulator register pair EDX:EAX. Some
details and quirks of MUL are discussed in Section 10.2.1 of this helpful
summary.

The instruction DIV can then divide this 64-bit result by a 32-bit constant
(the modulus p), storing the quotient in EAX and the remainder in EDX.
See Section 10.2.2 of the last link. The result we want is that remainder.

It is this division instruction DIV that entails a risk of overflow, should
the 64-bit product in numerator EDX:EAX give a quotient larger than 32-bits
by failing to satisfy (2) above.

I’m working on a code snippet in C/inline assembly for “proof of concept”.
However the maximum benefit in speed will depend on batching up arrays of
data a,b to process, amortizing the overhead of function calls, etc. in
Python (if that is the target platform).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to do the following operation many times: Take two integers a, b

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply