I need to do the following operation many times:
- Take two integers
a, b - Compute
a * b mod p, wherep = 1000000007anda, bare of the same order of magnitude asp
My gut feeling is the naive
result = a * b
result %= p
is inefficient. Can I optimise multiplication modulo p much like exponentiation modulo p is optimised with pow(a, b, p)?
To do this calculation in assembly, but have it callable from Python, I’d
try inline assembly from a
Python module written in C.
Both GCC and
MSVC
compilers feature inline assembly, only with differing syntax.
Note that our modulus
p = 1000000007just fits into 30-bits. The resultdesired
(a*b)%pcan be computed in Intel 80×86 registers given some weakrestrictions on
a,bnot being much bigger thanp.Restrictions on size of
a,b(1)
a,bare 32-bit unsigned integers(2)
a*bis less thanp << 32, i.e.ptimes 2^32In particular if
a,bare each less than2*p, overflow will be avoided.Given (1), it also suffices that either one of them is less than
p.The Intel 80×86 instruction MUL can multiply two 32-bit unsigned integers
and store the 64-bit result in accumulator register pair EDX:EAX. Some
details and quirks of MUL are discussed in Section 10.2.1 of this helpful
summary.
The instruction DIV can then divide this 64-bit result by a 32-bit constant
(the modulus
p), storing the quotient in EAX and the remainder in EDX.See Section 10.2.2 of the last link. The result we want is that remainder.
It is this division instruction DIV that entails a risk of overflow, should
the 64-bit product in numerator EDX:EAX give a quotient larger than 32-bits
by failing to satisfy (2) above.
I’m working on a code snippet in C/inline assembly for “proof of concept”.
However the maximum benefit in speed will depend on batching up arrays of
data
a,bto process, amortizing the overhead of function calls, etc. inPython (if that is the target platform).