Consensus Algorithms and Fault-Tolerance in Distributed Systems

We know that the distributed system is capable of massive data processing and scalable computing. As a distributed system, the blockchain system needs to rely on consensus, if its multiple nodes want to reach a consistent result on a particular state.


At Trias Workshop last week, Dr. Zhilin Hu systematically presented the consensus algorithm and its fault-tolerance.

His sharing is for:

Basic problems with distributed systems

Simply put, a distributed system is a set of nodes that communicate over a network and can accomplish common tasks in a coordinated manner.

figure 1

As you can see from the right side of figure 1, the previous central topology is a central server in the middle, with nodes distributed around it in a star shape.

In a distributed system, such as a blockchain, each node has the same permissions, and there is no difference between a central server and a terminal. There are some major issues here. Will the node work? If not, it could crash.

Communication between nodes is mainly through the network, which can cause delays, instability, and outages. The third case is the most challenging problem to solve at present, that is, some malicious nodes launch an attack on the whole network, which we usually call it ‘double spend attack’.

Before solving these problems, we should model and abstract them. From the perspective of the network, there are two models of a distributed system: Synchronous Model and Synchronous Model. What’s the difference?

The first is the clock drift of each node. For the Synchronization Model, although the clock of each node is different, the difference of the clock is limited. But for the Asynchronous Model, there is no upper limit.

Second, the network transmission time. For the Synchronous Model, messages between two nodes can be guaranteed to arrive accurately, while for the Asynchronous Model, the upper limit of the delay is uncertain.

The other is the computing rate of the nodes. In the Synchronous Model, the speed of each node is almost the same, but in the Asynchronous Model, the computing speed of the nodes is unpredictable.

For the Failure Model, if only the nodes crash or the network goes down, we call it a Non-Byzantine failure.

But if there is a malicious node, it is the Byzantine fault. This kind of failure is complicated to solve, and the complexity of the algorithm and the communication performance it consumes is much greater than the non-byzantine failure.

Distributed system theory

The first is the impossibility theory of FLP, that is, under the premise of a reliable network, it is impossible to have a deterministic algorithm to solve the consistency problem in any node failure, or the consistency problem of one or more minimized asynchronous model systems.

Distributed System

The second impossible principle is the CAP principle. Especially in traditional distributed systems, we are all very familiar with this theory. For example, in practice, it is impossible to guarantee strong consistency, availability and partition tolerance in a system at the same time, but to weaken one feature to ensure the other two based on engineering characteristics.

The third is the BASE theory. It is an extension of CAP principle, like CAP theory with strong consistency; ‘BA’ stands for basic usability, ‘S’ for a soft state, and ‘E’ for final consistency.

Like Bitcoin, the consensus is the ultimate consistency. We cannot guarantee that the blocks at a particular time must be valid blocks. After each block is generated, validation takes some time. The longer the time, the more likely it is that this block will be a valid block, which is the final consistency.

Consensus algorithms for distributed systems

To solve the above problems, we proposed the consensus algorithm,

Types of Consensus Algorithms

Usually, the current consensus algorithm has the following three parts:

The first, Paxos, addresses the issue of non-Byzantine distributed asynchronous network consensus;

The second, BFT, addresses the Byzantine tradition of distributed asynchronous network consensus;

The third is POW, addresses the Byzantine blockchain distributed asynchronous network consensus problem.

The first two types are commonly used in the traditional distributed system, which can ensure the normal operation in a network, but there are generally fewer nodes in the normal working network.

For the public chain, POW is the first consensus algorithm that can support tens of thousands of nodes or even larger nodes.

Trustworthy and Reliable Intelligent Autonomous Systems