Large Message Handling in GossipSub: Potential Improvements

October 31, 2024

Umar Farooq

20 min read

Motivation

The challenge of large message transmissions in GossipSub leads to longer than expected network-wide message dissemination times (and relatively higher fluctuations). It is particularly relevant for applications that require on-time, network-wide dissemination of large messages, such as Ethereum and Waku [1,2].

This matter has been extensively discussed in the libp2p community [3, 4, 5, 6, 7, 8], and numerous improvements have been considered (or even incorporated) for the GossipSub protocol to enable efficient large-message propagation [3, 7, 9, 10].

Problem description

Sending a message to N peers involves approximately $\lceil \log_D(N) \rceil$ rounds, with approximately $(D-1)^{X-1} \times D$ transmissions in each round, where $X, D$ represent the round number and mesh size.

Transmitting to a higher number of peers (floodpublish) can theoretically reduce latency by increasing the transmissions in each round to $(D-1)^{X-1} \times (F+D)$ , where $F$ represents the number of peers included in floodpublish.

This arrangement works fine for relatively small/moderate message sizes. However, as message sizes increase, significant rises and fluctuations in network-wide message dissemination time are seen.

Interestingly, a higher $D$ or $F$ can also degrade performance in this situation.

Several aspects contribute to this behavior:

Ideally, a message transmission to a single peer concludes in $\tau_1 = \frac {L}{R}+P$ (ignoring any message processing time), where $L, R, P$ represent message size, data rate, and link latency. Therefore, the time required for sending a message on a 100Mbps link with 100ms latency jumps from $\tau_1^{10KB} = 100.8ms$ for a 10KB message to $\tau_1^{1MB} = 180ms$ for a 1MB message. For $D$ peers, the transmission time increases to $\tau_D^{1MB} = (80 \times D) + 100ms$ , triggering additional queuing delays (proportional to the transmission queue size) during each transmission round.
In practice, $\tau_1^{1MB}$ sometimes rises to several hundred milliseconds, further exaggerating the abovementioned queuing delays. This rise is because TCP congestion avoidance algorithms usually limit maximum in-flight bytes in a round trip time (RTT) based on the congestion window ( $C_{wnd}$ ) and maximum segment size (MSS) to approximately ${C_{wnd} \times MSS}$ , with $C_{wnd}$ rising with the data transfer for each flow. Consequently, sending the same message through a newly established (cold) connection takes longer. The message transfer time lowers as the $C_{wnd}$ grows. Therefore, performance-improvement practices such as floodpublish, frequent mesh adjustment, and lazy sending typically result in longer than expected message dissemination times for large messages (due to cold connections). It is also worth mentioning that some TCP variants reset their $C_{wnd}$ after different periods of inactivity.
Theoretically, the message transmission time to D peers $(\tau_D)$ remains the same even if the message is relayed sequentially to all peers or simultaneous transmissions are carried out, i.e., $\tau_D = \sum_{i=1}^{D} \tau_i$ However, sequential transmissions finish early for individual peers, allowing them to relay early. This may result in quicker network-wide message dissemination.
A realistic network comprises nodes with dissimilar capabilities (bandwidth, link latency, compute, etc.). As the message disseminates, it's not uncommon for some peers to receive it much earlier than others. Early gossip (IHAVE announcements) may bring in many IWANT requests to the early receivers (even from peers already receiving the same message), which adds to their workload.
A busy peer (with a sizeable outgoing message queue) will enqueue (or simultaneously transfer) newly scheduled outgoing messages. As a result, already scheduled messages are prioritized over those published by the peer itself, introducing a significant initial delay to the locally published messages. Enqueuing IWANT replies to the outgoing message queue can further exaggerate the problem. The lack of adaptiveness and standardization in outgoing message prioritization are key factors that can lead to noticeable inconsistency in message dissemination latency at each hop, even in similar network conditions.
Message size directly contributes to peers' workloads in terms of processing and transmission time. It also raises the probability of simultaneous redundant transmissions to the same peer, resulting in bandwidth wastage, congestion, and slow message propagation to the network. Moreover, the benefits of sequential message relaying can be compromised by prioritizing slow (or busy) peers.
Most use cases necessitate validating received messages before forwarding them to the next-hop peers. For a higher message transfer time $(\tau )$ , this store-and-forward delay accumulates across the hops traveled by the message.

Possible improvements

1. Minimizing transfer time for large messages

The impact of message size and achievable data rate on message transmit time $\tau$ is crucial as this time accumulates due to the store-and-forward delay introduced at intermediate hops.

Some possible improvements to minimize overall message dissemination latency include:

a. Message fragmentation

In a homogeneous network, network-wide message dissemination time (ignoring any processing delays) can be simplified to roughly $\delta \approx \delta_{Tx} + P_h$ , where $\delta_{Tx}$ represents accumulative message transmit time denoted as $\delta_{Tx} = \frac{S}{R} \times h$ , with $S, R$ being the data size and data rate, and $h, P_h$ being the number of hops in the longest path and message propagation time through the longest path.

Partitioning a large message into n fragments reduces a single fragment transmit time to $\frac{\delta_{Tx}}{n}$ . As a received fragment can be immediately relayed by the receiver (while the sender is still transmitting the remaining fragments), it reduces the transmit time to $\delta_{Tx} = \frac{S}{R} \times \frac{2h-1}{n}$ .

This time reduction is mainly attributed to the smaller store-and-forward delay involved in fragment transmissions.

However, it is worth noting that many applications require each fragment to be individually verifiable. At the same time, message fragmentation allows a malicious peer to never relay some fragments of a message, which can lead to a significant rise in the application's receive buffer size.

Therefore, message fragmentation requires a careful tradeoff analysis between time and risks.

b. Message staggering

Considering the same bandwidth, the time $\tau_D$ required for sending a message to D peers stays the same, even if we relay to all peers in parallel or send sequentially to the peers, i.e., $\tau_D = \sum_{i=1}^{D} \tau_i$ .

However, sequential relaying results in quicker message reception at individual peers ( $\tau_1 \approx \frac{\tau_D}{D}$ ) due to bandwidth concentration for a particular peer. So, the receiver can start relaying early to its mesh members while the original sender is still sending the message to other peers.

As a result, after every $\frac{\tau_D}{D}$ milliseconds, the number of peers receiving the message increases by $2^X\ \forall\ X \in \{0, D-1\}$ and by $\sum_{k=X-D}^{X-1} \lambda_k\ \forall\ X \geq D$ . Here, $X$ represents message transmission round $X = i \cdot \frac{\tau_D}{D} \mid i \in \mathbb{N}_0$ , and $\lambda_k$ represents the number of peers that received the message in round $k$ .

It is worth noting that a realistic network imposes certain constraints on staggering for peers. For instance, in a network with dissimilar peer capabilities, placing a slow peer (also in cases where many senders simultaneously select the same peer) at the head of the transmission queue may result in head-of-line blocking for the message queue.

At the same time, early receivers get many IWANT requests, increasing their workload.

c. Message prioritization for slow senders

A slow peer often struggles with a backlog of messages in the outgoing message queue(s) for mesh members. Any new message transmission at this stage (especially the locally published messages) gets delayed. Adaptive message-forwarding can help such peers prioritize traffic to minimize latency for essential message transfers.

For instance, any GossipSub peer will likely receive every message from multiple senders, leading to redundant transmissions [11]. Implementing efficient strategies (only for slow senders) like lazy sending and prioritizing locally published messages/IWANT replies over already queued messages can help minimize outgoing message queue sizes and optimize bandwidth for essential message transfers.

A peer can identify itself as a slow peer by using any bandwidth estimation approach or simply setting an outgoing message queue threshold for all mesh members.

Eliminating/deprioritizing some messages can lower a peer's score, but it also earns the peer an overall better score by achieving some early message transfers.
For instance, sending many near-first messages can only save a peer from a deficit penalty. On the other hand, sending only one message (assuming MeshMessageDeliveriesThreshold defaults to 1) as the first delivered message can add to the accumulative peer score.

2. Mitigating transport issues

Congestion avoidance algorithms used in various TCP versions directly influence achievable throughput and message transfer time as maximum unacknowledged in-flight bytes are based on the congestion window $(C_{wnd})$ size.

Rapid adaptation of $C_{wnd}$ to the available network conditions can help lower message dissemination latency.

Therefore, selecting a more suitable TCP variant like BBR, which is known for its ability to dynamically adjust the congestion window based on network conditions, can significantly enhance GossipSub's performance.

At the same time, parameters like receive window scaling and initial $C_{wnd}$ also impact message transfer time, but these are usually OS-specific system-wide choices.

One possible solution is to raise $C_{wnd}$ by exchanging data over the newly established connection. This data may involve useful details like peer exchange information and gossip to build initial trust, or GossipSub can use some dummy data to raise $C_{wnd}$ to a reasonable level.

It's important to understand that some TCP variants reset $C_{wnd}$ after specific periods of inactivity [12]. This can lead to a decline in TCP's performance for applications that generate traffic after intervals long enough to trigger the resetting of the congestion window.

Implementing straightforward measures like transport-level ping-pong messages can effectively mitigate this problem [13].

The limitations faced with $C_{wnd}$ scaling also impact some performance optimizations in GossipSub. For instance, floodpublishing is an optimization relying on additional transmissions by the publisher to minimize message dissemination latency.

However, a small $C_{wnd}$ value in (new/cold) TCP connections established with floodpublish peers significantly increases message transmission time [4]. Usually, these peers also receive the same message from other sources during this time, wasting the publisher's bandwidth.

The same is the case with IWANT replies.

Maintaining a bigger mesh (with warm TCP connections) and relaying to $D$ peers can be a better alternative to this problem.

3. Eliminating redundant transmissions

For every received packet, a peer makes roughly $D$ transmissions to contribute its fair share to the spread of messages. However, the fact that many recipients had already received the message (from some other peer) makes this message propagation inefficient.

Although the $D$ -spread is attributed to quicker dissemination and resilience against non-conforming peers, many potential solutions can still minimize redundant transmissions while preserving the resilience of GossipSub.

These solutions, ranging from probabilistic to more knowledgeful elimination of messages from the outgoing message queue, not only address the issue of redundancy but also provide an opportunity for bandwidth optimization, especially for resource-constrained peers.

For instance, an IDONTWANT message, a key component of GossipSub (v1.2) [10], can significantly reduce redundant transmissions.

It allows any node to notify its mesh members that it has already received a message, thereby preventing them from resending the same message. This functionality is useful when a node receives a message larger than a specified threshold.

In such cases, the node promptly informs its mesh peers about the successful reception of the message by sending IDONTWANT messages.

It's important to note that an IDONTWANT message is essentially an IHAVE message, but with a crucial difference, i.e., IHAVEs are only transmitted during the heartbeat intervals, whereas IDONTWANTs are sent immediately after receiving a large message.

This prompt notification helps curtail redundant large message transmissions without compromising the GossipSub resilience.

However, the use of IDONTWANT messages alone has an inherent limitation. For instance, a peer can only send an IDONTWANT after receiving the complete message.

A large message transmission consumes significant time. For example, transmitting a 1MB message at 100 Mbps bandwidth may consume 80 to several hundred milliseconds (depending upon $C_{wnd}$ and latency).

As a result, other mesh members may also start transmitting the same message during this interval. A few potential solutions include:

a. Staggering with IDONTWANT messages

As previously discussed, staggering can significantly reduce network-wide message dissemination latency. This is primarily due to the relatively smaller store-and-forward delays that are inherent in this approach.

Using both staggering and IDONTWANT messages can further enhance efficiency by reducing redundant transmissions. This is because a node only saturates its bandwidth for a small subset of mesh peers, leading to early transmissions and prompt IDONTWANT message notifications to the mesh members.

It is worth highlighting that staggering can be implemented in various ways.

For example, it can be applied to peers (peer staggering) where a node sequentially relays the same message to all peers one by one.

Alternatively, a node can send a different message to every peer (message staggering or rotational sending), allowing IDONTWANTs for other messages to arrive during this time. The message staggering approach is beneficial when several messages are introduced to the network within a short interval of time.

As the peers in staggered sending are sequentially covered (with a faster speed due to bandwidth concentration), this leads to another problem.

The early covered peers send IHAVE (during their heartbeat intervals) for the messages they have received. IHAVE announcements for newly received large messages trigger IWANTs from nodes (including those already receiving the same message), leading to an additional workload for early receivers [14].

Potential solutions to mitigate these problems include:

Defering IHAVE announcements for large messages.

Deferring IHAVE announcements can indirectly prioritize message transmission to the mesh peers over IWANT replies. However, deciding on a suitable deferred interval is crucial for optimal performance. One possible solution is to generate IHAVEs only after the message is relayed to all the mesh peers.

Defering IWANT requests for messages that are currently being received.

This requires prior knowledge of msgIDs for the messages under reception. Knowing the message length is also essential in deciding a suitable defer interval to handle situations where a sender starts sending a message and never completes the transmission.

Not issuing IWANT for a message if at least $K$ peers have transmitted IDONTWANT for the same message (as this indicates that these peers will eventually relay this message).

However, this approach can inadvertently empower a group of non-conforming mesh peers to send IDONTWANT for a message and never complete message transmission. A delayed IWANT, along with negative peer scoring, can remedy this problem.

b. IMReceiving message

A peer can issue an IDONTWANT only after it has received the entire message. However, a large message transmission may take several hundred milliseconds to complete. During this time, many other mesh members may start relaying the same message.

Therefore, the probability of simultaneously receiving the same message from multiple senders increases with the message size, significantly compromising the effectiveness of IDONTWANT messages.

Sending a short preamble (containing msgID and length) before the message transmission can provide valuable information about the message. If a receiver is already receiving the same message from another sender, the receiver can request to defer this transmission by sending a brief IMReceiving message.

An IDONTWANT from the receiver will indicate successful message reception. Otherwise, the waiting sender can initiate transmission after a specific wait interval.

However, waiting for IMReceiving after sending the preamble can delay the message transmission. On the other hand, proceeding with message transfer (after sending the preamble) leads to another problem: it is difficult to cancel ongoing message transmission after receiving IMReceiving for the same message.

To streamline this process, a peer can immediately send an IMReceiving message (for every received preamble), urging other mesh peers to defer sending the same message [15, 16].

The other peers can send this message if IDONTWANT is not received from the receiver during the wait interval. This approach can boost IDONTWANT benefits by considering ongoing transmissions for large messages.

While IMReceiving messages can bring about substantial improvements in terms of latency and bandwidth utilization, it's crucial to be aware of the potential risks.

A malicious user can exploit this approach to disrupt message transmission either by never completing a message or by intentionally sending a message at an extremely slow rate to numerous peers.

This could ultimately result in network-wide slow message propagation.

However, carefully calibrating the deferring interval (based on message size) and negative peer scoring can help mitigate these risks.

c. IDONTWANT message with reduced forwarding

It is common for slow peers to pile up outgoing message queues, especially for large message transfers. This results in a significant queuing delay for outgoing messages. Reduced message forwarding can help decrease the workload of slower peers.

On receiving a message longer than the specified threshold, a slow peer can relay it to only $K \in D$ peers and send an IDONTWANT message to all the peers in $D$ .

In this arrangement, the IDONTWANT message serves an additional purpose: to promptly announce data availability, reinforcing redundancy in the presence of adversaries.

When a peer receives an IDONTWANT for an unseen message, it learns about the new message and can request it by sending an IWANT request without waiting for the heartbeat (gossip) interval. As a result, a significantly smaller number of transmissions is sufficient for propagating the message to the entire network.

This approach conserves peer bandwidth by minimizing redundant transmissions while ensuring GossipSub resilience at the cost of one RTT (for missing peers).

Interestingly, curtailing queuing delays can also help lower network-wide message dissemination latency (for huge messages).

However, finding an appropriate value for $K$ is crucial for optimal performance. A smaller $K$ saves peer bandwidth, while a larger $K$ achieves quicker spread until outgoing message queues pile up. Setting $K = D_{low}$ can be one option.

It is worth mentioning that such behavior may negatively impact peer scoring (by missing message delivery rewards from $D-K$ peers). However, a minimized workload enables early message dissemination to the remaining peers. These early transmissions and randomized $K$ set selection can help achieve an overall better peer score.

4. Message prioritization

Despite the standardized specifications of the GossipSub protocol, the message forwarding mechanisms can significantly impact network-wide message dissemination latency and bandwidth utilization.

It is worth mentioning that every node is responsible for transmitting different types of packets, including control messages, locally published messages, messages received from mesh members, IWANT replies, etc.

As long as traffic volume is lower than the available data rate, the message forwarding mechanisms yield similar results due to negligible queuing delays.

However, when the traffic volume increases and exceeds the available peer bandwidth (even for short traffic bursts), the outgoing message queue(s) sizes rise, potentially impacting the network's performance.

In this scenario, FIFO-based traffic forwarding can lead to locally published messages being placed at the end of the outgoing message queue, introducing a queuing delay proportional to the queue size. The same applies to other delay-sensitive messages like IDONTWANT, PRUNE, etc.

On the other hand, the segregation of traffic into priority and non-priority queues can potentially starve low-priority messages. One possible solution is to use weighted queues for a fair spread of messages.

Message prioritization can be a powerful tool to ensure that important messages reach their intended recipients on time and allow for customizable message handling.

For example, staggering between peers and messages can be better managed by using priority queues. However, it is important to note that message prioritization also introduces additional complexity to the system, necessitating sophisticated algorithms for better message handling.

5. Maximizing benefits from IWANT messages

During heartbeat intervals, GossipSub nodes transmit IHAVE messages (carrying IDs of seen messages) to the peers not included in the full-message mesh. These peers can use IWANT messages to request any missing messages. A budget counter ensures these messages never exceed a specified threshold during each heartbeat interval.

The IHAVE/IWANT messages are a crucial tool in maintaining network connectivity. They bridge the information gap between nearby and far-off peers, ensuring that information can be disseminated to peers outside the mesh. This function is essential in protecting against network partitions and indirectly aids in safeguarding against Sybil and eclipse attacks.

However, it is essential to understand that high transmission times for large messages require careful due diligence when using IWANT messages for reasons not limited to:

A large message reception may take several hundred milliseconds to complete. During this time, an IHAVE message announcing the same message ID will trigger an IWANT request.
A peer can send IWANT requests for the same message to multiple nodes, leading to simultaneous transmissions of the same message.
Replying to (potentially many) IWANT requests can delay the transmission of the same message to mesh peers, resulting in lower peer scores and slower message propagation.

A few possible solutions to mitigate this problem may include:

Issuing IHAVE announcements only after the message is delivered to many mesh peers.
Allocating a volume-based budget to service IWANT requests during each heartbeat interval.
Deferring IWANT requests for messages that are currently being received.
Deferring IWANT requests if at least $K$ IDONTWANTs are received for the same message.
A large message transmission can yield high $C_{wnd}$ ; preferring such peers during mesh maintenance can be helpful.

Summary

This study investigates the pressing issue of considerable fluctuations and rises in network-wide dissemination times for large messages.

We delve into multiple factors, such as increased message transmit times, store-and-forward delays, congestion avoidance mechanisms, and prioritization between messages, to establish a comprehensive understanding of the problem.

The study also explores the performance of optimization efforts like floodpublishing, IHAVE/IWANT messages, and message forwarding strategies in the wake of large message transmissions.

A key finding is that most congestion avoidance algorithms lack optimization for peer-to-peer networks. Coupling this constraint with increased message transmission times results in notable store-and-forward delays accumulating at each hop.

Furthermore, the probabilistic message-forwarding nature of GossipSub further exacerbates the situation by utilizing a considerable share of available bandwidth on redundant transmissions.

Therefore, approaches focused on eliminating redundant transmissions (IDONTWANT, IMReceiving, lazy sending, etc.) can prove helpful. At the same time, strategies aimed at reducing store-and-forward delays (fragmentation, staggering, prioritization, etc.) can prove beneficial.

It is worth mentioning that many of the strategies suggested in this post are ideas at different stages. Some of these have already been explored and discussed to some extent [5, 17, 18]. We are nearing the completion of a comprehensive performance evaluation of these approaches and will soon share the results of our findings.

Please feel free to join the discussion and leave feedback regarding this post in the VAC forum.

References

[1] EIP-4844: Shard Blob Transactions. Retrieved from https://eips.ethereum.org/EIPS/eip-4844

[2] Message Propagation Times With Waku-RLN. Retrieved from https://docs.waku.org/research/research-and-studies/message-propagation/

[3] Lenient Flood Publishing. Retrieved from https://github.com/libp2p/rust-libp2p/pull/3666

[4] Disable Flood Publishing. Retrieved from https://github.com/sigp/lighthouse/pull/4383

[5] GossipSub for Big Messages. Retrieved from https://hackmd.io/X1DoBHtYTtuGqYg0qK4zJw

[6] GossipSub: Lazy Sending. Retrieved from https://github.com/status-im/nim-libp2p/issues/850

[7] GossipSub: Limit Flood Publishing. Retrieved from https://github.com/vacp2p/nim-libp2p/pull/911

[8] GossipSub: Lazy Prefix Detection. Retrieved from https://github.com/vacp2p/nim-libp2p/issues/859

[9] Potential Gossip Improvement List for EIP4844. Retrieved from https://hackmd.io/@gRwfloEASH6NWWS_KJxFGQ/B18wdnNDh

[10] GossipSub Specifications v1.2: IDONTWANT Message. Retrieved from https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.2.md?plain=1#L52

[11] Number of Duplicate Messages in Ethereum’s GossipSub Network. Retrieved from https://ethresear.ch/t/number-duplicate-messages-in-ethereums-gossipsub-network/19921

[12] TCP Congestion Control: Re-starting Idle Connections. Retrieved from https://datatracker.ietf.org/doc/html/rfc2581#section-4.1

[13] PING/PONG Control Messages. Retrieved from https://github.com/libp2p/specs/pull/558

[14] IHAVE/IWANT Message Impact. Retrieved from https://github.com/vacp2p/nim-libp2p/issues/1101

[15] Large Message Handling IDONTWANT + IMReceiving Messages. Retrieved from https://forum.vac.dev/t/large-message-handling-idontwant-imreceiving/281

[16] IDONTWANT Message Impact. Retrieved from https://forum.vac.dev/t/idontwant-message-impact/283

[17] IWANT Message Impact. Retrieved from https://forum.vac.dev/t/iwant-messages-may-have-negative-impact-on-message-dissemination-latency-for-large-messages/366

[18] IDONTWANT Message Performance. Retrieved from https://vac.dev/rlog/gsub-idontwant-perf-eval/

Motivation​

Problem description​

Possible improvements​

1. Minimizing transfer time for large messages​

a. Message fragmentation​

b. Message staggering​

c. Message prioritization for slow senders​

2. Mitigating transport issues​

3. Eliminating redundant transmissions​

a. Staggering with IDONTWANT messages​

b. IMReceiving message​

c. IDONTWANT message with reduced forwarding​

4. Message prioritization​

5. Maximizing benefits from IWANT messages​

Summary​

References​