Substrate: Avoid missing a slot on slow proposer

Created on 15 Oct 2019  ·  29Comments  ·  Source: paritytech/substrate

I want to upload a contract file to the node, when the contract file is about 100kB, it can be successfully uploaded; but when I upload the contract at 240kB,the node can't generate a new block,the prompt message is as follows:
2019-10-15 17:50:56 Discarding proposal for slot 261855508; block production took too long
I compiled the substrate code version is commit 1ae7a901d9b2a7f4fefc7b5f5846c2d13ef6d8c7
by command cargo build --release
and started the node by ./target/release/substrate --dev,
the test contract wasm file I am using is as follows: reproduce_wasm_file.zip

I7-optimisation ⏱

Most helpful comment

@jeremymj This affects how we execute the block (either Wasm or Native (if available)):
https://github.com/paritytech/substrate/blob/a52a23db2390b36c7177161104be2731b80901bd/core/cli/src/execution_strategy.rs#L25

Note that this is merely a workaround - if you have incompatible version of the client that can't run natively, it will fall back to wasm anyway.

Imho we need to address:

  1. Figure out why verification takes so long and if we can speed it up.
  2. Figure out if the weight is properly applied (maybe it should be non linear)?
  3. Fix block production missing slot without marking the extrinsic as invalid (DoS vector).
  4. Consider producing an "uncomplete" block if the deadline is close (either this or 3 needs to be implemented).

All 29 comments

Is this a transaction weight issue ?

If an extrinsic take too much time to be executed then it shouldn't be put into the block, the weight mecanism should prevent block from being too big, I think the weight for this extrinsic is wrong.

I'm not sure how to solve it though.

cc @kianenigma

sidenote:

maybe some code could be optimised as well so we can handle contract of this size, but also maybe this contract size is quite big and the issue should be solve by reducing it with features like dynamic linking (which are not there yet).

From the looks of it, the block did take too long and the weight system did not catch it basically. But the transaction is actually one of the dynamic ones that we currently don't really deal with in the weight system. This can also be fixed by making the authorship code multi-threaded.

I must try and first reproduce this.

cc @pepyakin @andresilva

Thank you for your follow-up on this issue! I am now trying to optimize the contract business logic, reduce the contract size, and look forward to your updates.

I believe we should mark the extrinsic as invalid if we timeout during it's execution and it's the first one in the block.

@kianenigma I don't see how we could make authorship multi-threaded, you mean to somehow verify correctness separately?

@jeremymj as a workaround you may try --execution-block-construction=Native

I believe we should mark the extrinsic as invalid if we timeout during it's execution and it's the first one in the block.

like a back up solution if weight doesn't catch the too long transaction ?

Exactly, it allows us to prevent a situation where underpriced (underweighted) extrinsic can simply stall the network as we see in this issue.

@kianenigma I don't see how we could make authorship multi-threaded, you mean to somehow verify correctness separately?

It looks like the granularity of the time check in the current code is one transaction, meaning that if one transaction takes too long, we still miss the slot. I roughly imagine a parent thread that just moves on at some point if the proposal work takes too long should fix this. The slot worker should not wait for the proposer to finish and if the proposer fails to return for any reason, it should not miss the slot and just submit a half-full or even an empty block.

It looks like we have this and maybe the only problem is changing remaining_duration? the second argument of future::select() is what I was talking about actually
https://github.com/paritytech/substrate/blob/426c26b8bddfcdbaf8d29f45b128e0864b57de1c/core/consensus/slots/src/lib.rs#L186

Ah, I see. Indeed it makes sense. I don't think it requires threads though. You can imagine the proposer returning a Stream of proposals (instead of Future), and when deadline is reached you just take last element returned by that stream.

as for just reproduction, adding a thread::sleep() to any module's dispatchable and running a dev chain always leads to the same error (--execution native ofc.). Additionally, I am trying to add it as a testcase as well.

Ah, I see. Indeed it makes sense. I don't think it requires threads though. You can imagine the proposal being a Stream of proposals (instead of Future), and when deadline is reached you just take last element returned by that stream.

Seems very interesting. I am quite novice with futures but can give it a try.

Ah, I see. Indeed it makes sense. I don't think it requires threads though. You can imagine the proposer returning a Stream of proposals (instead of Future), and when deadline is reached you just take last element returned by that stream.

I don't understand this. What would be the last element? What kind of elements would the Proposer return?

Currently it produces a Block, but it would have to be a Stream of (unfinalized) blocks. I guess it would require copying the entire overlay every time though. I admit I haven't really looked into the code yet to check how feasible it is, but my reasoning was that after every extrinsic we would emit a copy of the intermediate state.

I don't understand this. What would be the last element? What kind of elements would the Proposer return?

I suppose Proposer.propose could either:

  • Instead of a block return a stream of extrinsics and the caller is responsible for making a block out of them.
  • Return an _unfinished block_ inside the stream as new transactions are added. The slot worker can at some point call it a day and move on.

But, I have a question about this polling and futures: From my understanding and @tomusdrw's explanations it looks like the proposer's .poll() can block the check for the deadline. Won't a stream suffer from more or less the same situation due to the underlying runtime?

Well, we do have select already, so we can ignore the slow future/stream, that just blocked on .poll() and resolve the Delay part. It just requires memoizing the previous value of the stream in the top-level future afaiu.

Currently it produces a Block, but it would have to be a Stream of (unfinalized) blocks. I guess it would require copying the entire overlay every time though. I admit I haven't really looked into the code yet to check how feasible it is, but my reasoning was that after every extrinsic we would emit a copy of the intermediate state.

Ahh, that sounds nice, but we should probably always return a Block. It would require some changes to the runtime api/block builder to support this kind of stuff, but it should be doable.

But I'm also not sure if all of this is so easy doable with Futures. Futures don't expect that internally you sleep and calling into the runtime is the same as sleep. We would need to build the blocks in a separate thread. So, this will require quite some amount of work.

This would also allow us to remove this max_duration parameter from Proposer.

Futures don't expect that internally you sleep and calling into the runtime is the same as sleep.

I think you meant blocking. As poll should complete fast, or otherwise you block the runtime thread (so other futures can't be polled), and calling into the runtime is one fat blocking operation.

Yeah sleeping/blocking, not such a big difference :P
Nevertheless, currently the code works, because we check the timer internally, but the futures logic does not works with the current implementation.

@kianenigma the issue is still about something different, we just talk about avoid missing slots. Nevertheless there is probably a bug in contracts. Checking a contract should probably not take that much time.

I assume the behaviour of the contract code is actually rather normal? the code is big and just takes too long to process? But yeah maybe they should be two separate issue. You can revert the title if you prefer to have it so.

I would wait for @pepyakin to comment on this issue.

@tomusdrw The problem is solved after adding the startup parameters, but there are additional optional values for this parameter. Where can I find out more about them?

@jeremymj which startup parameters?

I think @jeremymj is referring to --execution-block-construction=Native which was recommended earlier.

Nonetheless, I would not close this issue. The core problem still holds: future::select cannot poll the timeout and bail out, if a single runtime call (aka like a sleep) is taking too long, and as a consequence we miss a slot.

@bkchr I use the way @kianenigma recommended to start the node:
./target/release/substrate -dev --execution-block-construction=Native

@jeremymj This affects how we execute the block (either Wasm or Native (if available)):
https://github.com/paritytech/substrate/blob/a52a23db2390b36c7177161104be2731b80901bd/core/cli/src/execution_strategy.rs#L25

Note that this is merely a workaround - if you have incompatible version of the client that can't run natively, it will fall back to wasm anyway.

Imho we need to address:

  1. Figure out why verification takes so long and if we can speed it up.
  2. Figure out if the weight is properly applied (maybe it should be non linear)?
  3. Fix block production missing slot without marking the extrinsic as invalid (DoS vector).
  4. Consider producing an "uncomplete" block if the deadline is close (either this or 3 needs to be implemented).

@tomusdrw I see, thank you!

Sorry for belated reply. I think the problem is prosaic: put_code has some arbitrary cost in terms of gas, specifically it is 1 and I think that the weight is not applied in this case (besides the default one).

OTOH, while a 240k contract sounds to be big, it is not enourmous and I think it would be desirable that at least one contract of such size could fit into an empty block processing. Hence I'd like to see if and how we can speed up the contract code verification process and how much of an improvement #3869 would give us. I filled #3885 for this.

For anyone wanting to tackle this from the perspective of non-blockinng future, I've just written a test for my own understanding here that simulates what goes wrong in the slot worker when the author decides to sleep.

https://github.com/paritytech/substrate/compare/kiz-author-blocking?expand=1

Any updates on this that has not been recorded here? does the author still fail to return in time if one single extrinsic takes too much time?

I think this should be fixed now. You can still miss a slot, but we will backoff if we can not build the block in time. This means the allowed block time increases with the number of missed slots.

Given that, I will close this issue ;)

Was this page helpful?
0 / 5 - 0 ratings