Height:4140956,4140963,4140970,4140977,4140984,4140991, Block time > 30 s
@i359 in my opinion,this indicates that node 7 is either broken or disconnected from.p2p. The limit to do the job is 30 seconds (2 x 15), after that, next node will do the job replacing 7.
This is expected behavior from a byzantine recovery. Challenge is, we only.know its not operational after 30 seconds,so time is wasted. An alternative design includes two replicated primary nodes each round @vncoelho,so that protocol changes (change view) only happen when two.are simultaneously broken.
Agree, this node has something wrong and the view was changed
@igormcoelho @shargon ,Thanks for ur clarify !
But can we skip failing consensus nodes? because after dBFT 2.0, consensus nodes track all commits that have occurred.
But can we skip failing consensus nodes? because after dBFT 2.0, consensus nodes track all commits that have occurred.
We skip, but only after 30 seconds... that's the challenge. If we permanently skip it, means it has permanently failed, but in this case, we would need to replace it by some other, in order to guarantee the BFT safety level of M nodes.
Right now, I think the best (feasible) path is to expand for round-robin backup strategy, with two nodes working simultaneously (in a different round-robin scheme), such that a permanent failing pair would never happen (as this would break the whole f node bft guarantees). We are also investigating other types of consensus, radically different, but for a direct and practical fix of this issue, a consistent backup strategy would suffice (and then NEO voting could help us actually control malfunction nodes).
Like you said, having commits monitored on p2p, it's quite easy to see good nodes and create statistics for that... after that, voting could help us enforce these "good quality" standards.
Hi @i359,
I've been following block times since you opened this issue and I did not find any abnormality.
In the last 200 blocks, the longest time block I've found was 18 seconds (and it is quite rare). Most of them are in the 15-16 seconds interval.
Since I can't find the problem (I don't know if it really exists), I will close this issue.
If you see this problem happening again, please do not hesitate to create a new issue so we can investigate further.
Thanks.
agree @igormcoelho and @shargon, lets plan this design improvement for a medium term, 2-5 months from now. It will be nice to implement, @edgedlt has also been indicating good references.
Most helpful comment
Agree, this node has something wrong and the view was changed