I am working on GPU device which has very high division integer latency, several hundred cycles. I am looking to optimize divisions.
All divisions by denominator which is in a set { 1,3,6,10 }, however numerator is a runtime positive value, roughly 32000 or less. due to memory constraints, lookup table may not be a good option.
Can you think of alternatives?
I have thought of computing float point inverses, and using those to multiply numerator.
Thanks
PS. thank you people. bit shift hack is a really cool.
to recover from roundoff, I use following C segment:
// q = m/n
q += (n*(j +1)-1) < m;
can you build a lookup table for the denominators? since you said 15 bit numerators, you could use 17 for the shifts if everything is unsigned 32 bit:
The larger the shift the less the rounding error. You can do a brute force check to see how many times, if any, this is actually wrong.