I am implementing Paxos in a cluster simulator application, using the documentation available in Wikipedia. Unfortunately, it leaves several doors open to interpretation and does not provide much information on key implementation issues. It is unclear and incomplete.
- Assuming a cluster divided in 3 regions, each containing 3 nodes (total = 9 nodes). What happens if communication is broken between regions? There is no way any leader can reach quorum (which is 5).
Isn’t Paxos going to enter an infinite loop? I guess one should not initiate Paxos if one cannot communicate with at least a quorum of nodes.
- In Phase 1b: ‘If the proposal number N is larger than any previous proposal, then each Acceptor promises not to accept proposals less than N, and sends the value it last accepted for this instance to the Proposer‘.
What is ‘the last value it accepted’? Is it any previous proposal number from the proposer?
What does ‘instance’ refer to exactly in this case?
-
In Phase 1a: Does one include the value to agree on with the Prepare message or is this deferred to the Accept! message? Or it does matter?
-
In Phase 2a: ‘If any of the Acceptors have already accepted a value, the leader must Choose a value with the maximum proposal number N‘.
What is value here? Is it the proposal number? I believe not, but this phrase is unclear.
-
In Phase 2a: ‘Otherwise, the Proposer is free to choose any value’. What does this mean? A value for what? For the proposal number?
-
Paxos seems to rely on an increasing value of N (proposal number) to work? Is this correct?
-
The wikipedia entry does not discuss the initial values a node should set before starting to participate in Paxos. What are these?
P.S.: I don’t have enough reputation to create a ‘Paxos’ tag (any volunteer?)
The nomenclature in Paxos is a little unintuitive.
Paxos requires you can get at least a quorum (5 nodes in your case). Go with your solution of three regions; having two network partitions between the three regions is very bad news. I also use a version of Paxos which can change node membership from one instance to the next. This is useful for partitions and node failure.
A naive implementation of Paxos is not guaranteed to terminate because multiple nodes can leap-frog Prepare phases. There are two ways of getting around this. One is to have a random backoff before starting new Prepare phases. The second is to route all requests to a designated leader, that acts as proposer (The leader is chosen by a Paxos instance. See also Multi-paxos)
When a node receives an Accept!(roundId, value) message from a Proposer and it hasn’t promised to not accept the value (due to a Prepare!(higherRoundId) message), it stores the value and the roundId (I’ll call them
acceptedValueandacceptedRoundId). It may write over these due to subsequent Accept!(…) messages.When a node receives a Prepare!(roundId) message from a Proposer, it stores roundId as
promiseRoundId = max(roundId, promiseRoundId). It then sends aPromise!(acceptedRoundId, acceptedValue)back to the Proposer. NB: if a node hasn’t received an Accept!(…) message, it replies withPromise!(null, null).There is no need to send it. I don’t.
The value is the actual data the algorithm is reaching consensus on. I’ll rephrase this to
To start the Accept Phase, The Proposer must choose a value to be accepted depending on the results of the Prepare phase. If any Acceptor replied with Promise(roundId, value), the Proposer must use the value associated with the highest roundId. Otherwise, the Proposer received only Promise(null, null), and may choose any value to send to the acceptors.
NB: Proposal number here is the same thing as roundId.
This is the value you want to have consensus on. This is typically a state change across the distributed system, perhaps triggered by a client request.
Round ids (aka proposal numbers) should be increasing and must be unique per instance across all nodes. The Paxos paper assumes you can do this because it is trivial to achieve. Here’s one scheme that produces the same results on all nodes:
roundId = i*M + index[node]where i is the ith round this node is starting (that is i is unique per node per paxos instance, and is monotonically increasing).Or in pseudo-code (which is clearly lacking a few major optimizations):