Diem provides public access to the Diem Blockchain via a public FullNode network. This endpoint speaks DiemNet and exposing state sync and mempool protocols for downstream peers. This same network is leveraged by partners (e.g., VASPs) as their primary means to interact with the blockchain. Currently, this endpoint is provided as a best effort service with no guarantees for either partners or unknown public entities.
The major outcome of this work will be an enhanced interface that minimizes operator overhead for managing this access and improve the likelihood that partners and the public gets high quality access to the network. There is also an inherent assumption that a Diem upstream will be deployed in an environment that leverages a TCP connection rate limiter that ensures that a single IP address cannot establish more than 1 connection per a fixed period of time (e.g., a minute).
One of the problems we solved recently was Validator FullNode (VFN) fail over, in which case, one VFN would forward traffic to another if the upstream Validator went offline. This required constructing a secondary public network in the configs for VFNs for two reasons:
Secondly, we had an issue where if a participant wanted to access the network using the public good as well as a private (e.g., paid) service, they would need to have two network configurations to ensure they always connected to their private service. Even with this, though, there was no guarantee that this service would be used.
To remedy this we have a series of projects as defined above but defined here with more depth:
Described in more detail in #6859.
Since Diem provides public access, folks are free to produce their own implementations and interact with the service as they see fit. While the software currently has no known issues where a peer can make excessive network demands, it is most certainly feasible to build something that could harm performance. To resolve that, Diem will leverage a token bucket filter (TBF) that will ensure that each peer is allocated a fair and reasonable amount of inbound and outbound traffic. Specifically this will result in an inbound and outbound TBF for each connection / peer and due to the expectation of the connection rate limiter discussed above will prevent peers behind a single IP from exceeding their allowed throughput.
On the inbound side, the TBF will be configured in such a fashion as to not read bytes from the TCP connection unless there are appropriate amount of tokens. This will enforce TCP back pressure and prevent a peer from forcing the Diem software in processing excessive incoming messages.
On the outbound side, the TBF will be configured in such a way as to provide pressure to the applications that will in practice result in excessive responses to queries being dropped on the floor.
It is important that these are made configurable and contain appropriate logs and metrics in the chance that a legitimate peer experience performance degradation due to excessively strict configurations.
The public good is a limited good in that it cannot support infinite downstreams without failure. In order to ensure that those that can connect maintain their connection and that the upstream service does not crash due to excessive open connections, Diem enforces a connection limit. Once a node has hit this limit additional connections will not be permitted without operator intervention.
As a second part of this work, once optional mutual authentication is available, such peers can bypass this connection limit as we expect that internally the amount of connections is expected to be in the order of 10s.
Thanks for this proposal. I think connection throttling (starting with incoming/outgoing bandwidth) will certainly be useful for help prevent DoS attacks. A few questions/comments:
1) Any thoughts about other resource consumption type attacks such as requests that are low in bandwidth but us excessive cpu, memory, etc.?
2) How do we ensure that authenticated peers get enough resource access? Is there a plan to subdivide resources between authenticated connections vs public connections? E.g. If I have 10 Gbps, can I devote 6 Gbps to authenticated connections (which is subdivided equally) and 4 Gbps to public connections?
3) Will public (non-authenticated) connections still use negotiated keys to encrypt data?
@aching, thanks for the questions:
Thanks for this proposal. I think connection throttling (starting with incoming/outgoing bandwidth) will certainly be useful for help prevent DoS attacks. A few questions/comments:
- Any thoughts about other resource consumption type attacks such as requests that are low in bandwidth but us excessive cpu, memory, etc.?
On this side I think we'd need analytics at least about amount of resources used by a connection or application. One of the things we should look at is ways to identify bad actors in general. That way we'd be able to take advantage of either blocking (via IP rules), or rate limiting (via the token bucket) to mitigate that.
- How do we ensure that authenticated peers get enough resource access? Is there a plan to subdivide resources between authenticated connections vs public connections? E.g. If I have 10 Gbps, can I devote 6 Gbps to authenticated connections (which is subdivided equally) and 4 Gbps to public connections?
I considered for the future about allowing us to either change rate limiting based on trusted / untrusted, as well as being able to dynamically change limits in the future. I think that also comes down to how do I want to have the QoS not only for trusted vs untrusted, but ensure that trusted peers get all of their needs met (e.g. mempool & state-sync are guaranteed to make some progress and not overtake each other in the rate limits).
Regarding 1) and 2), I tend to think a lot about cgroups as a reasonable starting point for a model (although we don't want cgroups most likely since we'd have to have a process per limiter). But the hierarchical nature seems like a possible way to think about modeling resource limiting.
Additionally, regarding 1), we can think about bounding RPC calls to avoid resource hogging. For instance, state sync can only return X bytes per call, be forced to make Y I/O calls, etc.
Most helpful comment
On this side I think we'd need analytics at least about amount of resources used by a connection or application. One of the things we should look at is ways to identify bad actors in general. That way we'd be able to take advantage of either blocking (via IP rules), or rate limiting (via the token bucket) to mitigate that.
I considered for the future about allowing us to either change rate limiting based on trusted / untrusted, as well as being able to dynamically change limits in the future. I think that also comes down to how do I want to have the QoS not only for trusted vs untrusted, but ensure that trusted peers get all of their needs met (e.g. mempool & state-sync are guaranteed to make some progress and not overtake each other in the rate limits).