04 Oct 2024

Learning Dfinity P2P layer

This is a personal note for learning DFINITY Internet Computer (IC) P2P layer.

The learning resources include:

1. Terminology

1.1. IC design principle

Secure, reliable and scalable.
Scalability relies on the network message distribution efficiency; hence the IC network is divided into subnets.

1.2. IC protocol stack

From top to down: execution, message routing, consensus, P2P.

1.2.1. Execution layer

Manages a safe environment for deterministic execution of software messages.

1.2.2. Message routing layer

Routes user and system-generated messages between subnets.
Manages the input and output queues for applications.
Schedules messages for execution.

1.2.3. Consensus layer

Arranges messages received from users and different subnets into blocks before delivering them to the message routing layer.

1.2.4. P2P layer

Collects and advertises messages from users and other nodes in the same subnet, e.g., IC consensus protocol, the state sync protocol.
Main challenges: security, performance and scalability.

1.3. State sync

Enables nodes to synchronize the replicated state of the subnet by downloading the required state and verify its authenticity relative to the subnet’s chain key with the state sync protocol.
Up-to-date nodes create a checkpoint regularly by writing the replicated state to disk, computing the state hash, the consensus layer tries to agree on the state by including the hash in the catch-up packages (CUPs).
- A state recorded on the CUP implies the majority of nodes agree on the state and a majority of nodes can serve this state.

1.4. QUIC protocol

A new transport protocol with TLS 1.3, new handshake, new packet structure, encryption, uni/bi-directional streams, etc.
Low latency: zero-handshake latency for recently visited sites, no head-of-line (HOL) blocking.
- HOL blocking: the delivery of subsequent packets in a data stream is delayed due to the need to process or retransmit an earlier packet.
Encrypted transport: encrypt most of the packet header by using the connection IDs.
- The connection IDs also tracks migrated connections securely.
- Packet numbers are also encrypted to avoid cross-network correlation.
Packetization: support multiple stream frames in a single packet.
- Allow unaffected streams to continue processing even if one stream has packet loss.
- Enable multiplexing of different data streams over a single connection.
Stateless design: each QUIC packets contain enough information to be processed independently and has built-in mechanism to match responses to requests.

1.5. Backpressure in blockchain

If the receiver slows down the message consumption, the sender’s buffer fills up and the sender’s networking layer must take one of the three paths:
- Propagate the backpressure to the application layer to slow down data production; would be a DoS attack vector.
- Buffer messages (indefinitely); also an attack vector.
- Drop egress messages; risks the liveness, i.e., no delivery guarantee.

2. P2P layer

Send out artifacts created by the above layers.
Receive, validate and distribute artifacts from users and other nodes.
Guarantees the secure eventual broadcast delivery to all nodes.
- asynchronous communication network assumption provides no time upper bound.
- Provide bounded time delivery under some network assumptions.

2.1. Gossip protocol

When a subnet node creates an artifact or receives it from its peer, it gossips this artifact to all its peers, i.e., other connected subnet nodes.
- Each artifact eventually propagates through the whole subnet.

2.2. Artifacts

Network messages to be broadcast in the subnet, e.g.,
- Users input to canister smart contracts;
- Protocol-originating messages, e.g., blocks produced by the consensus layer or state synchronization certification.

2.3. Constraints

P2P layer should provide the above guarantee under the following constraints:
- Bounded-time delivery or eventual (under weaker assumptions) delivery.
- Tolerate up to a certain threshold of dropped artifacts or byzantine nodes (1/3).
- Each peer has its own share of the resources available at other peers and the resource usage by each peer is bounded.
- Should be able to prioritize different artifacts.
- Provide high throughput (rather than low latency) and avoid data duplication to utilize the network.
- Resilient to DoS/Spam and enable encryption, authenticity and integrity.

2.4. Adverts

Simply flooding all artifacts consumes unnecessary network bandwidth, instead artifacts previews called adverts are sent first.
An advert includes fields used by the gossip protocol and its application components (which process the messages) for integrity verification and decision-making (e.g., which artifacts to prioritize)
Other nodes may request the corresponding artifact from one or more of its peers who send the adverts.
In the new P2P layer, if the artifact is small enough (< 1KB), adverts are not used.

2.4.1. Advert priority

Consensus provides the gossip protocol with a priority function, which takes an advert and its attributes and returns a priority value.
Each P2P client decide how to request the artifact (e.g.,drop or fetch immediately).

2.5. Artifact pool

A data structure maintained by the gossip protocol.
Contain all available artifacts for each application component.
P2P informs consensus and other client components about the pool changes by calling on_state_change(), each component determines its next action, e.g., validation.
- Each call returns a set of ChangeActions corresponding to the addition and deletion of artifacts from the validated pool of a client; the corresponding adverts are then broadcasted.
The artifacts can be persistent to non-volatile storage.
Separated into validated and unvalidated sections; the size of each unvalidated section for each peer is bounded to prevent bad peers from filling up the pool.

2.6. Peer context

Advert queue: a priority queue of all adverts the peer has received from another peer, ordered by their priority.
Requested set: contains all adverts whose artifacts have been requested.
Received check cache: used to prevent duplicated artifact requests.

2.7. Gossip protocol events

Create a new advert for a new artifact added by application component locally and sends to all peers;
Process a new advert.
Process a new artifact.
Handle recovery and reconnection.

2.7.1. Process a new advert received from peer \(i\)

If the advert is already in peer \(i\)’s received check cache, or the priority is “drop”, ignore;
If the advert is not in the pool, adds to the advert queue for peer \(i\);
If enough space for peer \(i\)’s unvalidated section in the artifact pool, call download_next(i) (with a timeout) to ask for the next artifact with the highest priority (not necessarily corresponds to the last received advert).
download_next(i) sends the request to peer \(i\) and moves the advert with the highest priority from advert queue to the requested set of peer \(i\);
If download_next(i) is timeout, check whether the advert is also received from other advert queue and try to fetch it from them before retrying with peer \(i\) (as peer \(i\) may be misbehaving).

2.7.2. Process a new received artifact from peer \(i\)

First check the received artifact has been requested and verify its integrity;
Remove all corresponding adverts from all advert queues and requested sets;
Add the artifact to peer \(i\)’s unvalidated pool and wait for the client component to validate;
Add the artifact hash to peer \(i\)’s received check cache for a grace period to ignore further adverts for the same artifact;
- If the artifact is removed from the unvalidated section later (e.g., the artifact is invalid), the application component may request it again from peer \(i\), with the received cache mechanism the gossip protocol does not send out the request again in a short time.
If enough space for peer \(i\)’s unvalidated section, call download_next(i).

2.8. Common attacks

2.8.1. Sybil attack

An attack creates multiple nodes to gain influences.

2.8.2. Eclipse attack

All peers of a correct node are byzantine nodes to disconnect the correct node from the subnet (though they cannot send spoofed artifacts due to artifact authentication).
Mitigation: use overlays with large min cut and expansion so that at least one peer is correct for each node.
- Each node uses a different overlap.
- Small enough subnets can be a complete graph, large subnets use more sparse overlays.

2.8.3. Routing table poisoning

Malicious nodes provide false routing information to disrupt network topology.

2.8.4. Bandwidth hogging

Attackers consume excessive network resources.

2.9. Security

NNS manages the subnet membership, each node only requests and accepts connections with nodes in the same subnet to prevent DoS attack.
NNS canisters guarantee that all communication between two nodes are encrypted and authenticated by TLS.

3. Transport component

Below the P2P component to main the actual network connections between peers.
Have its own send buffers and internal heartbeat mechanism, which are important for bounded time delivery.
Frame gossip message with its own 7 headers.

3.1. Transient connection disturbances

Transport keeps buffering ongoing messages in TX queues;
When the connection works again, transmit all buffered messages and empty TX queues.

3.2. Long disconnection with full TX queue

Transport notifies the receiver gossip protocol, sends retransmission request with artifact filter to tell the sender the latest advert the receiver has seen;
- The receiver may not need to catch up all artifacts since they may have received the same adverts from other peers before sending the retransmission.
When receiving a retransmission request, the sender sends all relevant adverts according to the filter through the TX queue.
If the TX queue becomes full again, another retransmission takes place.

4. State sync protocol

Nodes periodically advertise all locally available state versions with adverts;
- A version is the block height to which the state corresponds, and the state hash.
If a node sees a more recent CUP, it can conclude it has fallen behind and can request the state from the peer which sends the CUP;
- The protocol ensures unchanged pieces of a state are not re-downloaded, as the state can be viewed as a file tree and each file is split into chunks.
- A node can simultaneously request chunks from multiple peers like BitTorrent.
The resuming node starts by requesting the manifest chunk, which contains a list of all files and chunks as well as their hashes the state contains;
- The manifest is peer-agnostic and the manifest hash is included in the CUP.
- Once the manifest hash is verified, one can conclude all file and chunk hashes are authentic.
The node then request missing chunks from multiple peers.

4.1. Monolithic P2P layer not suitable for the state sync

P2P layer is designed to distribute small messages, which is not the case for the state sync protocol.
To simplify the P2P layer, it is separated into 2: one for state sync and the other for the rest clients.
The P2P layer uses a new transport component to support two async APIs: push(message, peer_id) and rpc(message, peer_id) -> response.
- P2P periodically calls push() with the current states to advertise own current state to all peers.
- When noticing itself is behind, it calls rpc() to request specific chunks
Using a single TCP steam is impossible to relate requests to responses without tracking states, therefore QUIC protocol is used to multiplex multiple streams in one connection.
- P2P layer can be completely asynchronous to better utilize CPU and bandwidth resources, e.g., congestion on state synchronization does not necessarily affect other adverts.
- Every response is tied to a corresponding request without having to maintain states.
- Can help dynamically prioritize traffic of different clients.

5. Fully QUIC-based P2P layer

IC’s P2P layer stops using TCP altogether, which means a shift to a fully asynchronous implementation of the P2P layer.
- Each request is sent as a new QUIC stream and handled independently from other requests.
- Each client (e.g., consensus,state sync, key distribution) uses a separate instance of the P2P layer (state sync uses the specific one).
Uses a new abstract data structure slot table to track the content of the validated artifact pool and the process of updating the peers.
Reduces the block rate under heavy load.
Will eventually shift all clients to use the new P2P layer.

5.1. Properties for consensus-related clients

Bounded number of active artifacts in the validated artifact pool; the consensus protocol uses checkpoints to purge artifacts periodically, hence a maximal pool size \(C\).
Explicit expiry of artifacts; if an artifact is purged from a pool, it is no longer disseminated to peers; if no peer of a node has an artifact, the node is guaranteed to not need that artifact even if it failed to receive it.
- During state synchronization, nodes use artifacts to update own states, once states are updated, artifacts can be safely purged after some time, e.g., to help other nodes to synchronize states.
- Newly joined nodes use CUP for offline state sync.

5.2. Slot table

Maintained on the sender side and inferred on the receiver side.
The table size corresponds to the number of active artifacts in the validated pool.
Whenever an artifact is added to the validated pool, it is added in the slot table on the sender side.
- The sender sends out a slot update message to all peers.
- Receivers infer the slot table state based on the arrived slot update messages.
- Deletions are implicitly propagated by new artifact reusing the slot.
- Allows nodes to notice when an artifact no longer exists in any peer’s slot tables and can remove it from the unvalidated pool.
Each slot maintains a version number for the slot artifact; receivers only accept update messages with higher version numbers than the one it already has.
A lightweight thread is spawn for each slot per peer to reliably push the slot update message.
The approach combines buffering messages and dropping messages in handling the backpressure to achieve resilience and liveness.
Bounds on the unvalidated pool: \(C\) artifacts from an honest peer and \(2C\) from a malicious peer.