# Notes on PaxosNov 1, 2020

These are my notes after learning the Paxos algorithm. The primary goal here is to sharpen my own understanding of the algorithm, but maybe someone will find this explanation of Paxos useful! This post assumes fluency with mathematical notation.

I must confess it took me a long time to understand distributed consensus. Ive read a whole bunch of papers (Part Time Parliament, Paxos Made Simple, Practical BFT, In Search of an Understandable Consensus Algorithm, CASPaxos: Replicated State Machines without logs), but they didnt make sense. Or rather, nothing specific was unclear, but, at the same time, I was unable to answer the core question:

That means that I didnt actually understand the algorithm.

What finally made the whole thing click are

I now think that the thing is actually much simpler than it is made to believe :-)

Buckle in, we are starting!

## What is Paxos?

Paxos is an algorithm for implementing distributed consensus. Suppose you have `N` machines which communicate over a faulty network. The network may delay, reorder, and lose messages (it can not corrupt them though). Some machines might die, and might return later. Due to network delays, machine is dead and machine is temporary unreachable are indistinguishable. What we want to do is to make machines agree on some value.Agree here means that if some machine says value is X, and another machine says value is Y, then X necessary is equal to Y. It is OK for machine to answer I dont know yet.

The problem with this formulation is that Paxos is an elementary, but subtle algorithm. To understand it (at least for me), a precise, mathematical formulation is needed. So, lets try again.

What is Paxos? Paxos is a theorem about sets! This is definitely mathematical, and is true (as long as you base math on set theory), but is not that helpful. So, lets try again.

What is Paxos? Paxos is a theorem about nondeterministic state machines!

A system is characterized by a state. The system evolves in discrete steps: each step takes system from `state` to `state'`. Transitions are non-deterministic: from a single current `s1`, you may get to different next states `s2` and `s3`. (non-determinism models a flaky network). An infinite sequence of systems states is called a behavior:

Due to non-determinism, theres a potentially infinite number of possible behaviors. Nonetheless, depending on the transition function, we might be able to prove that some condition is true for any state in any behavior.

Lets start with a simple example, and also introduce some notation. I wont use TLA+, as I dont enjoy its concrete syntax. Instead, math will be set in monospaced unicode.

The example models an integer counter. Each step the counter decrements or increments (non-deterministically), but never gets too big or too small

The sate of the system is a single variable `counter`. It holds a natural number. In general, we will represent a state of any system by a fixed set of variables. Even if the system logically consists of several components, we model it using a single unified state.

The `Init` formula specifies the initial state, the `counter` is zero. Note that `=` is a mathematical equality, and not an assignment. `Init` is a predicate on states.

`Init` is true for `{counter: 0}`.
`Init` is false for `{counter: 92}`.

`Next` defines a non-deterministic transition function. It is a predicate on pairs of states, `s1` and `s2`. `counter` is a variable in the `s1` state, `counter'` is the corresponding variable in the `s2` state. In plain English, transition from `s1` to `s2` is valid if one of these is true:

• Value of `counter` in `s1` is less than `9` and value of `counter` in `s2` is greater by 1.
• Value of `counter` in `s1` is greater than `0`, and value of `counter` in `s2` is smaller by 1.

`Next` is true for `({counter: 5}, {counter: 6})`.
`Next` is false for `({counter: 5}, {counter: 5})`.

Here are some behaviors of this system:

• `0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → 9`
• `0 → 1 → 0 → 1 → 0 → 1`
• `0 → 1 → 2 → 3 → 2 → 1 → 0`

Here are some non behaviors of this system:

• `1 → 2 → 3 → 4 → 5`: `Init` does not hold for initial state
• `0 → 2`: `Next` does not hold for `(0, 2)` pair
• `0 → 1 → 0 → -1`: `Next` does not hold for `(0, -1)` pair

behavior means that the initial state satisfies `Init`, and each transition satisfies `Next`.

We can state and prove a theorem about this system: for every state in every behavior, the value of counter is between 0 and 9. Proof is by induction:

• The condition is true in the initial state.
• If the condition is true for state `s1`, and `Next` holds for `(s1, s2)`, then the condition is true for `s2`.
• QED.

As usual with induction, sometimes we would want to prove a stronger property, because it gives us more powerful base for an induction step.

To sum up, we define a non-deterministic state machine using two predicates `Init` and `Next`. `Init` is a predicate on states which restricts possible initial states. `Next` is a predicate on pairs of states, which defines a non-deterministic transition function. `Vars` section describes the state as a fixed set of typed variables. `Sets` defines auxiliary fixed sets, elements of which are values of variables. `Theorem` section specifies a predicate on behaviors: sequences of steps evolving according to `Init` and `Next`.

The theorem does not automatically follow from `Init` and `Next`, it needs to be proven. Alternatively, we can simulate a range of possible behaviors on a computer and check the theorem for the specific cases. If the set of reachable states is small enough (finite would be a good start), we can enumerate all behaviors and produce a brute force proof. If there are too many reachable states, we cant prove the theorem this way, but we often can prove it to be wrong, by finding a counter example. This is the idea behind model checking in general and TLA+ specifically.

## What is Consensus?

Having mastered the basic vocabulary, lets start slowly building towards Paxos. We begin with defining what consensus is. As this is math, well do it using sets.

The state of the system is a set of chosen values. For this set to constitute consensus (over time) we need two conditions to hold:

• at most one value is chosen
• if we choose a value at one point in time, we stick to it (math friendly: any two chosen values are equal to each other)

Heres the simplest possible implementation of consensus:

In the initial state, the set of chosen values is empty. We can make a step if the current set of chosen values is empty, in which case we select an arbitrary value.

This technically breaks our behavior theory: we require behaviors to be infinite, but, for this spec, we can only make a single step. The fix is to allow empty steps: a step which does not change the state at all is always valid. We call such steps stuttering steps.

The proof of the first condition of the consensus theorem is a trivial induction. The proof of the second part is actually non-trivial, heres a sketch. Assume that `i` and `j` are indices, which violate the condition. They might be far from each other in state-space, so we cant immediately apply `Next`. So lets choose the smallest `j1 ∈ [i+1;j]` such that the condition is violated. Let `i1 = j1 - 1`. The condition is still violated for `(i1, j1)` pair, but this time they are subsequent steps, and we can show that `Next` does not hold for them, concluding the proof.

Yay! We have a distributed consensus algorithm which works for 1 (one) machine:

## Simple Voting

Lets try to extend this to a truly distributed case, where we have `N` machines (acceptors). We start with formalizing the naive consensus algorithm: let acceptors vote for values, and select the value which gets a majority of votes.

The state of the system is the set of all votes cast by all acceptors. We represent a vote as a pair of an acceptor and the value it voted for. Initially, the set of votes is empty. On each step, some acceptor casts a vote for some value (adds `(a, v)` pair to the set of votes), but only if it hasnt voted yet. Remember that `Next` is a predicate on pairs of states, so we check `votes` for existing vote, but add a new one to `votes'`. The value is chosen if the set of acceptors which voted for the value (`{a ∈ 𝔸: (a, v) ∈ votes}`) is at least half as large as the set of all acceptors. In other words, if a majority of acceptors has voted for the value.

Lets prove consensus theorem for Majority Vote protocol. TYPE ERROR, DOES NOT COMPUTE. The consensus theorem is a predicate on behaviors of states consisting of `chosen` variable. Here, `chosen` isnt a variable, `votes` is! `chosen` is a function which maps current state to some boolean.

While it is intuitively clear what consensus theorem would look like for this case, lets make this precise. Lets map states with `votes` variable to states with `chosen` variable using the majority rule, `f`. This mapping naturally extends to a mapping between corresponding behaviors (sequences of steps):

Now we can precisely state that for every behavior `B` of majority voting spec, the theorem holds for `f(B)`. This yields a better way to prove this! Instead of proving the theorem directly (which would again require i1, j1 trick), we prove that our mapping `f` is a homomorphism. That is, we prove that if `votes_0 → votes_1 → ...` is a behavior of the majority voting spec, then `f(votes_0) → f(votes_1) → ...` is a behavior of the consensus spec. This lets us to re-use existing proof.

The poof for initial step is trivial, but lets spell it out just to appreciate the amount of details a human mind can glance through

Lets show that if Majority Votes `Next_m` holds for `(votes, votes')`, then Consensuss `Next_c` holds for `(f(votes), f(votes'))`. Theres one obstacle on our way: this claim is false! Consider a case with three acceptors and two values: `𝔸 = {a1, a2, a3}`, `𝕍 = {v1, v2}`. Consider these values of `votes` and `votes'`:

If you just mechanically check `Next`, you see that it works! `a3` hasnt cast its vote, so it can do this now. The problem is that `chosen(votes) = {v1}` and `chosen(votes') = {v1, v2}`.

We are trying to prove too much! `f` works correctly only for states reachable from `Init`, and the bad value of `votes` where `a1` votes twice is not reachable.

So, we first should prove a lemma: each acceptor votes at most once. After that, we can prove `Next_m(votes, votes') = Next_c(f(votes), f(votes'))` under the assumption of at most once voting. Specifically, if `|f(votes')|` turns out to be larger than `1`, then we can pick two majorities which voted for different values, which allows to pin down a single acceptor which voted twice, which is a contradiction. The rest is left as an exercise for the reader :)

So, majority vote indeed implements consensus. Lets look closer at the majority condition. It is clearly important. If we define `chosen` as

then its easy to construct a behavior with several chosen values. The property of majority we use is that any two majorities have at least one acceptor in common. But any other condition with this property would work as well as majority. For example, we can assign an integer weight to each acceptor, and require the sum of weights to be more than half. As a more specific example, consider a set of for acceptors `{a, b, c, d}`.

Its majorities are:

But the following set of sets would also satisfy non-empty intersection condition:

Operationally, it is strictly better, as fewer are acceptors needed to reach a decision.

So lets refine the protocol to a more general form.

We require to specify a set of quorums set a of subsets of acceptors such that every two quorums have at least one acceptor in common. The value is chosen if there exists a quorum such that its every member voted for the value.

Theres one curious thing worth noting here. Consensus is a property of the whole system, theres no single place where we can point to and say hey, this is it, this is consensus. Imagine 3 acceptors, sitting on Earth, Venus, and Mars, and choosing between values `v1` and `v2`. They can execute Quorum Vote algorithm without communicating with each other at all. They will necessary reach consensus without knowing which specific value they agreed on! An external observer can then travel to the three planets, collect the votes and discover the chosen value, but this feature isnt built into the algorithm itself.

OK, so weve just described an algorithm for finding consensus among `N` machines, proved the consensus theorem for it, and noted that it has staggering communication efficiency: zero messages. Should we collect our Turing Award?

Well, no, theres a big problem with Quorum Vote it can get stuck. Specifically, if there are three values, and the votes are evenly split between them, then no value is chosen, and only stuttering steps are possible. If you can vote for different values, it might happen that neither value receives a majority of votes. Voting satisfies the safety property, but not the liveness property the algorithm can get stuck even if all machines are on-line and communication is perfect.

There is a simple fix to the problem, with a rich historical tradition among many democratic governments. Lets have a vote, and lets pick the value chosen by the majority, but lets allow to vote only for a single candidate value:

The new condition says that an acceptor is only allowed to cast a vote if all other votes are for the same value. As a special case, if the set of votes is empty, the acceptor can vote for any value (but all other acceptors would have to vote for this value afterwards).

From a mathematical point of view, this algorithm is perfect. From a practical stand point, not so much: an acceptor to cast the first vote somehow needs to make sure that it is indeed the first one. The obvious fix to this problem is to assign a unique integer number to each acceptor, call the highest-numbered acceptor leader, and allow only the leader to cast the first decisive vote.

So acceptors first communicate with each other to figure out who the leader is, then the leader casts the vote, and the followers follow. But this also violates liveness: if the leader dies, then the followers would wait indefinitely. A fix for this problem is to let the second highest acceptor to take over the leadership if the leader perishes. But under our assumptions, its impossible to distinguish between a situation when the leader is dead from a situation when it just has a really bad internet connection. So naively picking successor would lead to a split vote and a standstill again (power transitions are known to be problematic for authoritarian regimes in real life too!). If only there were some kind of distributed consensus algorithm for picking the leader!

## Ballot Voting

This is the place were we start discussing real Paxos :-) It starts with a ballot voting algorithm. This algorithm, just like the ones weve already seen, does not define any messages. Rather, message passing is an implementation detail, so well get to it later.

Recall that rigged voting requires all acceptors to vote for a single values. It is immune to split voting, but is susceptible to getting stuck when the leader goes offline. The idea behind ballot voting is to have many voting rounds, ballots. In each ballot, acceptors can vote only for a single value, so each ballot individually can get stuck. However, as we are running many ballots, some ballots will make progress. The value is chosen in a ballot if it is chosen by some quorum of acceptors. The value is chosen in an overall algorithm if it is chosen in some ballot.

The Turing award question is: how do we make sure that no two ballots choose different values? Note that it is OK if two ballots choose the same value.

Lets just brute force this question, really. First, assume that the ballots are ordered (for example, by numbering them with natural numbers). And lets say we want to pick some value `v` to vote for in ballot `b`. When `v` is safe? Well, when no other value `v1` can be chosen by any other ballot. Lets tighten this up a bit.

Value `v` is safe at ballot `b` if any smaller ballot `b1` (`b1 < b`) did not choose and will not choose any value other than `v`.

So yeah, easy-peasy, we just need to predict which values will be chosen in the future, and we are done! Well deal with it in a moment, but lets first convince ourselves that, if we only select safe values for voting, we wont violate consensus spec.

So, when we select a safe value `v` to vote for in a particular ballot, it might get chosen in this ballot. We need to check that it wont conflict with any other value. For smaller ballots thats easy its the definition of safety condition. What if we conflict with some value `v1` chosen in a future ballot? Well, that value is also safe, so whoever chose `v1`, was sure that it wont conflict with `v`.

How do we tackle the precognition problem? Well ask acceptors to commit to not voting in certain ballots. For example, if you are looking for a safe value for ballot `b` and know that theres a quorum `q` such that each quorum member never voted in smaller ballots, and promised to never vote in smaller ballots, you can be sure that any value is safe. Indeed, any quorum in smaller ballots will have at least one member which would refuse to vote for any value.

Ok, but what if theres some quorum member which has already voted for some `v1` in some ballot `b1 < b`? (Take a deep breath, the next sentence is the kernel of the core idea of Paxos). Well, that means that `v1` was safe at `b1`, so, if there will be no votes between `b1` and `b`, `v1` is also safe at `b`! (Exhale). In other words, to pick a safe value at `b` we:

1. Take some quorum `q`.
2. Make everyone in `q` promise to never vote in ballots earlier than `b`.
3. Among all of the votes already cast by the quorum members we pick the one with the highest ballot number.
4. If such vote exists, its value is a safe value.
5. Otherwise, any value is safe.

To implement the never vote promise, each acceptor will maintain `maxBal` value. It will never vote in ballots smaller or equal to `maxBal`.

Lets stop hand-waving and put this algorithm in math. Again, we are not thinking about messages yet, and just assume that each acceptor can observe the state of the whole system.

Lets unwrap this top-down. First, the `chosen` condition says that it is enough for some quorum to cast votes in some ballot for a value to be accepted. Its trivial to see that, if we fix the ballot, then any two quorums would vote for the same value quorums intersect. Showing that quorums vote for the same value in different ballots is the tricky bit.

The `Init` condition is simple no votes, any acceptor can vote in any ballot (= any ballot with number larger than -1).

The `Next` consists of two cases. On each step of the protocol, some acceptor either votes for some value in some ballot `∃ v ∈ 𝕍: Vote(a, b, v)`, or declares that it wont cast additional vote in small ballots `AdvanceMaxBal(a, b)`. Advancing ballot just sets `maxBal` for this acceptor (but takes care not to rewind older decisions). Casting a vote is more complicated and is predicated on three conditions:

• We havent forfeited our right to vote in this ballot.
• If theres some vote in this ballot already, we are voting for the same value.
• If there are no votes, then the value should be safe.

Note that the last two checks overlap a bit: if the set of votes cast in a ballot is not empty, we immediately know that the value is safe: somebody has proven this before. But it doesnt harm to check for safety again: a safe value can not become unsafe.

Finally, the safety check. It is done in relation to some quorum if `q` proves that `v` is safe, than members of this quorum would prevent any other value to be accepted in early ballots. To be able to do this, we first need to make sure that `q` indeed finalized their votes for ballots less than `b` (`maxBall` is at least `b - 1`). Then, we need to find the latest vote of `q`. There are two cases

• No one in `q` ever voted (`b1 = -1`). In this case, there are no additional conditions on `v`, any value would work.
• Someone in `q` voted, and `b1` is the last ballot when someone voted. Then `v` must be the value voted for in `b1`. This implies `Safe(v, b1)`.

If all of these conditions are fulfilled, we cast our vote and advance `maxBall`.

This is the hardest part of the article. Take time to fully understand Ballot Vote.

Rigorously proving that Ballot Voting satisfies Consensus would be tedious the specification is large, and the proof would necessary use every single piece of the spec! But lets add some hand-waving. Again, we want to provide homomorphism from Ballot Voting to Consensus. Cases where the image of a step is a stuttering step (the set of chosen values is the same) are obvious. Its also obvious that the set of chosen values never decreases (we never remove votes, so a value can not become unchosen). It also increases by at most one value with each step.

The complex case is to prove that, if currently only `v1` is chosen, no other `v2` can be chosen as a result of the current step. Suppose the contrary, let `v2` be the newly chosen value, and `v1` be a different value chosen some time ago. `v1` and `v2` cant belong to the same ballot, because every ballot contains votes only for a single value (this needs proof!). Lets say they belong to `b1` and `b2`, and that `b1 < b2`. Note that `v2` might belong to `b1` nothing prevents smaller ballot from finishing later. When we chose `v2` for `b2`, it was safe. This means that some quorum either promised not to vote in `b1` (but then `v1` couldnt have been chosen in `b1`), or someone from the quorum voted for `v2` in `b1` (but then `v1 = v2` (proving this might require repeated application of safety condition)).

Ok, but is this better than Majority Voting? Can Ballot Voting get stuck? No if at least one quorum of machines is online, they can bump their `maxBall` to a ballot bigger than any existing one. After they do this, there necessary will be a safe value relative to this quorum, which they can then vote on.

However, Ballot Voting is prone to a live lock if acceptors continue to bump `maxBal` instead of voting, theyll never select any value. In fact, in the current formulation one needs to be pretty lucky to not get stuck. To finish voting, there needs to be a quorum which can vote in ballot `b`, but not in any smaller ballot, and in the above spec this can only happen by luck.

It is impossible to completely eliminate live locks without assumptions about real time. However, when we implement Ballot Voting with real message passing, we try to reduce the probability of a live lock.

## Paxos for Real

One final push left! Given the specification of Ballot Voting, how do we implement it using message passing? Specifically, how do we implement the logic for selecting the first (safe) value for the ballot?

The idea is to have a designated leader for each ballot. As there are many ballots, we dont need a leader selection algorithm, and can just statically assign ballot leaders. For example, if there are N acceptors, acceptor 0 can lead ballots `0, N, 2N, …`, acceptor 1 can lead `1, N + 1, 2N + 1, …` etc.

To select a value for ballot `b`, the ballots leader broadcasts a message to initiate the ballot. Upon receiving this message, each acceptor advances its `maxBall` to `b - 1`, and sends the leader its latest vote, unless the acceptor has already made a promise to not vote in `b`. If the leader receives replies from some quorum, it can be sure that this quorum wont vote in smaller ballots. Besides, the leader knows quorums votes, so it can pick a safe value.

In other words, the practical trick for picking a safe value is to ask some quorum to abstain from voting in small ballots and to pick a value consistent with votes already cast. This is the first phase of Paxos, consisting of two message types, 1a and 1b.

The second phase is to ask the quorum to cast the votes. The leader picks a safe value and broadcasts it for the quorum. Quorum members vote for the value, unless in the meantime they happened to promise to a leader of the bigger ballot to not vote. After a member voted, it broadcasts its vote. When a quorum of votes is observed, the value is chosen and the consensus is reached. This is the second phase of Paxos with messages 2a and 2b.

Lets write this in math! To model message passing, we will use `msgs` variable: a set of messages which have ever been send. Sending a message is adding it to this set. Receiving a message is asserting that it is contained in the set. By not removing messages, we model reorderings and duplications.

The messages themselves will be represented by records. For example, phase 1a message which initiates voting in ballot `b` will look like this:

Another bit of state well need is `lastVote` for each acceptor, what was the last ballot the acceptor voted in, together with the corresponding vote. It will be `null` if the acceptor hasnt voted.

Lets go through each of the phases.

`Phase1a` initiates ballot `b`. It is executed by the ballots leader, but theres no need to model who exactly the leader is, as long as it is unique. This stage simply broadcasts 1a message.

`Phase1b` is executed by an acceptor `a`. If `a` receives `1a` message for ballot `b` and it can vote in `b`, then it replies with its `lastVote`. If it cant vote (it has already started some larger ballot), it simply doesnt respond. If enough acceptors dont respond, the ballot will get stuck, but some other ballot might succeed.

`Phase2a` is the tricky bit, it checks if the value `v` is save for ballot `b`.

First, we need to make sure that we havent already initiated `Phase2a` for this ballot. Otherwise, we might initiate `Phase2a` for different values. Here is the bit where it is important that the ballots leader is stable. The leader needs to remember if it has already picked a safe value.

Then, we collect 1b messages from some quorum (we need to make sure that every quorum member has send 1b message for this ballot). Value `v` is safe if the whole quorum didnt vote (`vote` is null), or if it is the value of the latest vote of some quorum member. We know that quorum members wont vote in earlier ballots, because they had increased `maxBal` before sending 1b messages.

If the value indeed turns out to be safe, we broadcast 2a message for this ballot and value.

Finally, in `Phase2b` an acceptor `a` votes for this value, if its `maxBall` is still good. The bookkeeping is updating `maxBal`, `lastVote`, and sending the 2b message.

The set of 2b messages corresponds to the `votes` variable of the Ballot Voting specification.

## Notes on Notes

Theres a famous result called FLP impossibility: Impossibility of Distributed Consensus with One Faulty Process. But weve just presented Paxos algorithm, which works as long as more than half of the processes are alive. What gives? FLP theorem states that theres no consensus algorithm with finite behaviors. Stated in a positive way, any asynchronous distributed consensus algorithm is prone to live-lock. This is indeed the case for Paxos.

Liveness can be improved under partial synchronity assumptions. Ie, if we give each process a good enough clock, such that we can say things like if no process fails, Paxos completes in `t` seconds. If this is the case, we can fix live locking (ballots conflicting each other) by using naive leader selection algorithm to select the single acceptor which can initiate ballots. If we dont reach consensus after `t` seconds, we can infer that someone has failed and re-run naive leader selection. If we are unlucky, naive leader selection will produce two leaders, but this wont be a problem for safety.

Paxos requires atomicity and durability to function correctly. For example, once the has leader picked safe value and has broadcasted a 2a message, it should persist the selected value. Otherwise, if it goes down and then resurrects, it might choose a different value. How to make a choice of value atomic and durable? Write it to a local database! How to make local transaction atomic and durable? Write it first into the write ahead log? How to write something to WAL? Using the `write` syscall/DMA. What happens if the power goes down exactly in the middle of the write operation? Well, we can write a chunk of bytes with a checksum! Even if the write itself is not atomic, a checksummed write is! If we read the record from disk and checksum matches, then the record is valid.

I use slightly different definition of `maxBal` (less by one) than the one in the linked lecture, dont get confused about this!