I’m having trouble understanding the algorithm being used in this FPGA circuit. It deals

Question

0

Asked: May 21, 20262026-05-21T04:20:28+00:00 2026-05-21T04:20:28+00:00

I’m having trouble understanding the algorithm being used in this FPGA circuit. It deals

0

I’m having trouble understanding the algorithm being used in this FPGA circuit. It deals with redundant versus non-redundant number format. I have seen some mathematical (formal) definitions of non-redundant format but I just can’t really grasp it.

Excerpt from this paper describing the algorithm:

Figure 3 shows a block diagram of the scalable Montgomery multiplier. The kernel contains p w-bit PEs for a total of wp bit cells. Z is stored in carry-save redundant form. If PE p completes Z^0 before PE1 has finished Z^(e-1), the result must be queued until PE1 becomes available again. The design in [5] queues the results in redundant form, requiring 2w bits per entry. For large n the queue consumes significant area, so we propose converting Z to nonredundant form to save half the queue space, as shown in Figure 4. On the first cycle, Z is initialized to 0. When no queuing is needed, the carry-save redundant Z’ is bypassed directly to avoid the latency of the carry-propagate adder. The nonredundant Z result is also an output of the system.

And the diagrams:
Figure 3 is high level, Figure 4 is the FIFO and is 'improved' by making it use non-redundant format.

And here is the “improved” PE block diagram. This shows the ‘improved’ PE block diagram – ‘improved’ has to do with some unrelated aspects.
'Improved' PE Block Diagram

I don’t have a picture of the ‘not improved’ FIFO but I think it is just a straight normal FIFO. What I don’t understand is, does the FIFO’s CPA and 3 input MUX somehow convert between formats?

Understanding redundant versus non-redundant formats (in concrete examples) is the first step, understanding how this circuit achieves it would be step 2..

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T04:20:28+00:00

A bit of background and a look at users.ece.utexas.edu/~adnan/vlsi-05-backup/lec12Datapath.ppt suggests the following:

Doing a proper binary add is relatively slow and/or area-consuming, because of the time that it takes to propagate the carries properly.

If you work bit-wise in parallel you can take three binary numbers, sum the bits at the same location in each number, and produce two binary numbers.

Slide 27 points out that 0001 + 0111 + 1101 = 1011 + 0101(0).

Since a multiplier needs to do a LOT of additions, you build the adder tree as a collection of reductions of 3 numbers to 2 numbers, eventually ending up with two numbers as output, abcde….z
and ABCDE…Z0. This is your output in redundant form, and the true answer is in fact abcde…z + ABCDE…Z0

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m having trouble understanding the algorithm being used in this FPGA circuit. It deals

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply