Right now the sc-network directly calls the client, tx-pool, finality proof builder, and so on (full list in the config module).
However whenever one of these components takes a long time to respond, the entire networking stack freezes for everyone.
Not only does this happens right now due to some runtime performance issues, but it also makes us quite vulnerable to attacks.
Instead of performing straight calls, we should in my opinion spawn a few background tasks (or even a new background task per call when that is possible) and perform the calls there.
My personal opinion is that isolating the calls to the clients/tx-pool/... should be performed by sc-service, but from a pragmatic perspective let's do it in sc-network.
Instead of passing directly an Arc<dyn Client> to the sync state machine, we should instead pass some sort of Isolated<Arc<dyn Client>> where Isolated exposes an asynchronous API.
Yes and no. It depends how #3230 is implemented.
What I had in mind for #3230 is that when we call client.call(...) we should not put the thread to sleep while waiting for something to happen. For example, the example I gave in #3230 is that we shouldn't block the thread while waiting for a network message to be received.
But even if we do that, it is still possible for client.call(...) to take several seconds of CPU operations (because of a lot of calculations) during which the network will be frozen.
But we could also consider using separate tasks as part of #3230, I don't know.
We should start a channel on matrix and continue there.
We should start a channel on matrix and continue there.
Why? This is the right place to discuss the implementation.
The code that is relevant for this issue is here: https://github.com/paritytech/substrate/blob/ae36c62223a8191e5bbca14ca5df3bc7d0edb6ea/client/network/src/chain.rs#L45
Ideally all the traits mentioned in this file would have asynchronous methods.
Since this is far from being trivial, we should, as mentioned above, add to the networking code an IsolatedClient struct and an IsolatedFinalityProofProvider struct that wrap around implementations of these traits and do the work in a background task.
When it comes to making the networking code async-friendly, it shouldn't be very hard to adjust light_client_handler.rs and block_requests.rs. However protocol.rs would be very tough to change.
I suggest we keep the synchronousity of protocol.rs and only adjust the new protocols.
However whenever one of these components takes a long time to respond, the entire networking stack freezes for everyone.
How so? Isn't it run on a thread pool?
I don't see how this proposed change will solve anything. Simply moving a bottleneck to a different place won't help much. You'd just get an overflow in the task queue.
How so? Isn't it run on a thread pool?
Answering ping, parsing messages, and everything "close to the socket" is done in a thread pool. Answering requests when access to the client is needed is not.
Simply moving a bottleneck to a different place won't help much.
Right now if the client takes a long time we just accumulate network messages from nodes. After this change, we would instead process all messages quickly/immediately and instead accumulate messages that we need to answer.
This means that:
1- We could detect when the client is over-capacity and return some sort of "sorry we're busy" error to the remote.
2- We could continue processing the block announces, transactions, and so on that the nodes send us.
As a more general problem, we call the client from within Future::poll, which is supposed to return quickly. Blocking in Future::poll is normally a no-no.
Answering requests when access to the client is needed is not.
Why not? Client itself is thread safe. I'm pretty sure handling requests does not lock anything for the duration of the client query. At least it was not initially.
1- We could detect when the client is over-capacity and return some sort of "sorry we're busy" error to the remote.
Request handling needs to be fair. E.g. bootnode resources need to scale evenly for all connecting peers. If the node is overloaded the request needs to wait till timeout and not be immediately denied.
2- We could continue processing the block announces, transactions, and so on that the nodes send us.
See above. I don't see why it is not the case right now.
This sounds like a lot of additional complexity to work around the actual problem. Handling network requests should be fast. If your network handler takes seconds to execute it needs to be optimized first, and not shoved into the queue. And if such a queue was to be organized, i'd argue that it should be managed explicitly, instead of relying on tokio executor or creating another unbounded channel. As it will most certainly require some kind of per-peer throttling.
Most helpful comment
Why? This is the right place to discuss the implementation.