Go-ethereum: PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock?

Created on 7 Jan 2019 · 73Comments · Source: ethereum/go-ethereum

System information

My current version is:

Geth
Version: 1.8.17-stable
Git Commit: 8bbe72075e4e16442c4e28d999edee12e294329e
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10

Expected behaviour

Keep the normal signing .

Actual behaviour

I was running a go-ethereum private network with 6 sealers.

Each sealer is run by:

directory=/home/poa
command=/bin/bash -c 'geth --datadir sealer4/  --syncmode 'full' --port 30393 --rpc --rpcaddr 'localhost' --rpcport 8600 --rpcapi='net,web3,eth' --networkid 30 --gasprice '1' -unlock 'someaddress' --password sealer4/password.txt --mine '

The blockchain was running good for about 1-2 months.

Today i found that all the nodes were having issues. Each node was emmiting the message "Signed recently, must wait for others"

I check out the logs and i found this message every 1 hour, no more information, the nodes where not mining:

Regenerated local transaction journal transactions=0 accounts=0
Regenerated local transaction journal transactions=0 accounts=0
Regenerated local transaction journal transactions=0 accounts=0
Regenerated local transaction journal transactions=0 accounts=0

Experimenting the same issue with 6 sealers, i restarted each node but now im get stucked in

INFO [01-07|18:17:30.645] Etherbase automatically configured address=0x5Bc69DC4dba04b6955aC94BbdF129C3ce2d20D34
INFO [01-07|18:17:30.645] Commit new mining work number=488677 sealhash=a506ec…8cb403 uncles=0 txs=0 gas=0 fees=0 elapsed=133.76µs
INFO [01-07|18:17:30.645] Signed recently, must wait for others

The first thing that is weird is that, some nodes are stucked on the 488677 and others are on 488676, this behaviour was reported on this issue https://github.com/ethereum/go-ethereum/issues/16406 same for the user @lyhbarry

Example:
Signer 1

Signer 2

Note that there is no votes pending

So, right now, i shut down and restar each node, i have found that:

Each node is paired with the others
Each node is part of clique.getSigners()
Each node is waiting for another to sign...

INFO [01-07|18:41:56.134] Signed recently, must wait for others 
INFO [01-07|19:41:42.125] Regenerated local transaction journal    transactions=0 accounts=0
INFO [01-07|18:41:56.134] Signed recently, must wait for others

So, the syncronization fail but also i just can start signing again because each node is stucked waiting for the others, that means, the network is useless?

The comment of @tudyzhb on that issue mention that:

Ref clique-seal of v1.8.11, I think there is no an effective mechanism to retry seal, when an in-turn/out-of-turn seal fail occur. So our dev network useless easily.

After this problem, i take a look at the logs, each signer has this error messages:

Synchronisation failed, dropping peer peer=7875a002affc775b err="retrieved hash chain is invalid"

INFO [01-02|16:42:10.902] Signed recently, must wait for others 
WARN [01-02|16:42:11.960] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
INFO [01-02|16:42:12.128] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=540.282µs mgasps=0.000  number=488116 hash=269920…afd3c7 cache=5.99kB
INFO [01-02|16:42:12.129] Commit new mining work                   number=488117 sealhash=f7b00c…787d5c uncles=2 txs=0 gas=0     fees=0          elapsed=307.314µs
INFO [01-02|16:42:20.929] Successfully sealed new block            number=488117 sealhash=f7b00c…787d5c hash=f17438…93ffe3 elapsed=8.800s
INFO [01-02|16:42:20.929] 🔨 mined potential block                  number=488117 hash=f17438…93ffe3
INFO [01-02|16:42:20.930] Commit new mining work                   number=488118 sealhash=b09b33…1526ba uncles=2 txs=0 gas=0     fees=0          elapsed=520.754µs
INFO [01-02|16:42:20.930] Signed recently, must wait for others 
INFO [01-02|16:42:31.679] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=2.253ms   mgasps=0.000  number=488118 hash=763a32…a579f5 cache=5.99kB
INFO [01-02|16:42:31.680] 🔗 block reached canonical chain          number=488111 hash=3d44dc…df0be5
INFO [01-02|16:42:31.680] Commit new mining work                   number=488119 sealhash=c8a5e7…db78a1 uncles=2 txs=0 gas=0     fees=0          elapsed=214.155µs
INFO [01-02|16:42:31.680] Signed recently, must wait for others 
INFO [01-02|16:42:40.901] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=808.903µs mgasps=0.000  number=488119 hash=accc3f…44bc4c cache=5.99kB
INFO [01-02|16:42:40.901] Commit new mining work                   number=488120 sealhash=f73978…c03fa7 uncles=2 txs=0 gas=0     fees=0          elapsed=275.72µs
INFO [01-02|16:42:40.901] Signed recently, must wait for others 
WARN [01-02|16:42:41.961] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"

I also see some:

INFO [01-02|16:58:10.902] 😱 block lost number=488205 hash=1fb1c5…a41a42
This error about hash chain was just a warning, so the node keep mining until the 2th of january, then i saw this on each of the 6 nodes

I was looking that there are a lot of issues about this error, the most similar is the one i posted here but is unresolved.

Most of the issues workarrounds seems to be a restart, but in this case, the chain seems to be is in a unconsistent state and the nodes are always waiting for the others

So,

any ideas? peers are connected, accounts are unlocked, it just entered into a deadlock situation after 450k blocks
any logs that i can provide? i only see the warnings of the error described and the block lost, but nothing when the node stoped to be mining
Is this PR related? https://github.com/ethereum/go-ethereum/pull/18072
Maybe is related with the comment of @karalabe onthis issue https://github.com/ethereum/go-ethereum/issues/16406?
5 Upgrading from 1.8.17 to 1.8.20 will solve this?
In my opinion, seems like a race condition or something, since i have 2 chains, one running for 2 months, the other one for three months and is the first time this error happens

This are other related issues:

https://github.com/ethereum/go-ethereum/issues/16444 (Same issue but i dont have votes pending in my snapshot)

https://github.com/ethereum/go-ethereum/issues/14381#

https://github.com/ethereum/go-ethereum/issues/16825

https://github.com/ethereum/go-ethereum/issues/16406

core help wanted investigation research

Source

marcosmartinez7

👍2

Most helpful comment

Reviewed in team call: @5chdn suggestions are good. We could solve this by making out-of-turn difficulty more complicated. There should be some deterministic order to out-of-turn blocks. @karalabe fears that this will introduce too much protocol complexity or large reorgs.

fjl on 5 Mar 2019

🎉5

All 73 comments

Based on this image

That is the situation of all the sealers, they just stop sealing waiting for each other, seems like a deadlock situation

Wich files can i check for errors since the js console isnt throwing anything?

marcosmartinez7 on 8 Jan 2019

This is the debug.stacks() info, i dont know it is important here but this is executing while the sealers are stucked:

marcosmartinez7 on 8 Jan 2019

Found that i have a lot of block lost on each node..

Can be this the problem? the chain was running with that warnings without any issues anyway..

Btw, it is caused by bad connection between nodes? Im using Digital Ocean droplets

NOTE: if i check eth.getBlockNuber i get 488676 or 488675 depending on the sealer

marcosmartinez7 on 8 Jan 2019

We experienced a similar deadlock on our fork, and the cause was due to out-of-turn blocks all being difficulty 1, mixed with a bit of bad luck. When the difficulties are the same, a random block is chosen as canonical, which can produce split decisions. You can compare hashes and recent signers on each node to confirm if your network is deadlocked in the same way. We had to both modify the protocol to produce distinct difficulties for each signer, and modify the same difficulty tie-breaker logic to make more deterministic choices.

jmank88 on 8 Jan 2019

👍3

Thanks for the response, can you give me an idea of how "compare hashes and recent signers on each node to confirm if your network is deadlocked in the same way." ?

Thanks

marcosmartinez7 on 8 Jan 2019

By getting the last 2 blocks from each node, you should be able to see exactly why they are stuck based on their view of the world. They all think that they have signed too recently, so they must disagree on what the last few blocks are supposed to be, so you'll see different hashes and signers for blocks with the same number (and difficulty!).

jmank88 on 8 Jan 2019

🎉1

Good idea!!

Sealer 1

Last block 488676

Last -1 = 488675

Sealer 2

Last is 488675

The second node didnt reach the 488676 block.

The hashes of block 488675 are different, but the difficulty are differents (1 and 2)

For other blocks, like block 8, the hashes are equals and the difficulty is 2 for both..

Seems like all the blocks has 2 of difficulty except that conflictive one..did you find any logical explanation to that?

Btw, dont know why difficulty = 2 since the genesis file uses 0x1

Thoughts?

marcosmartinez7 on 8 Jan 2019

The in-turn signer always signs with difficulty 2. Out-of-turn signers sign with difficulty 1. This is built-in to the clique protocol, and the primary cause of this problem in the first place. It looks like you have 6 signers. You will have to check them all to make sense of this.

jmank88 on 8 Jan 2019

So, if i found two signers (into my 6) with the same difficulty and different hash the deadlock would make sense right?

Same block, different difficulty and different hash doesnt probe anything?

I have deleted the chaindata of the other node with the same last block 488675

fail

marcosmartinez7 on 8 Jan 2019

Not necessarily. Those kind of ambiguous splits happen very frequently with clique and would normally sort themselves out.

Are you still trying to recover this chain?

jmank88 on 8 Jan 2019

If it is not necesarrily and it normally sort themselves out, then the deadlock theory maybe isnt valid..

What did you mean about "It looks like you have 6 signers. You will have to check them all to make sense of this."?

About the chain: I wanted to know what happened basically, i dont know if i can provide any kind of logs or something since the sealers just stoped to wait each other and i dont have any other information.

Also, getting this scenario in a production environment sucks, since i cant continue mining..and there is nothing on go-ethereum that guarantees that this will not happen again

So, just to make the things more clear, if the block 488675 has different difficulty and different hash doesnt probe that there was an issue? It is normal to have different hashes comparing in-turn with out-turn then?

marcosmartinez7 on 8 Jan 2019

Resyncing the signers that you deleted may produce a different distributed state which doesn't deadlock. Or it could deadlock again right away (or at any point in the future). Making fundamental protocol changes to clique like we did for GoChain is necessary to avoid the possibility completely, but can't be applied to an existing chain (without coding in a custom hard fork). You could start a new chain with GoChain instead.

jmank88 on 8 Jan 2019

What did you mean about "It looks like you have 6 signers. You will have to check them all to make sense of this."?

They all have different views of the chain. You can't be sure why each one was stuck without looking at them all individually.

jmank88 on 8 Jan 2019

Ok, but, what i am looking for?

Right now im deleting chain data for all the nodes except 1 and resync the rest of them (5 singers) from that node.

About this comment:

"By getting the last 2 blocks from each node, you should be able to see exactly why they are stuck based on their view of the world. They all think that they have signed too recently, so they must disagree on what the last few blocks are supposed to be, so you'll see different hashes and signers for blocks with the same number (and difficulty!)."

If i see two in turn or two out turn with the same difficulty and different hash that will confirm that they think that they have signed recently?

marcosmartinez7 on 8 Jan 2019

If i see two in turn or two out turn with the same difficulty and different hash that will confirm that they think that they have signed recently?

If they logged that they signed too recently then you can trust that they did. Inspecting the recent blocks would just give you a more complete picture of what exactly happened.

jmank88 on 8 Jan 2019

Well, i delete all the chain data for the 5 sealers and sync from 1

Started to work again but there is a sealer that seems to have connectivity issues or something..

The sealer starts with 6 peers, then goes to 4, 3, 2 then again to 4, 6 , etc...

And thats why i suppose the blocks are being lost... and probably thats why the sycnronization fail warning is throwed since is always the same node

Any ideas of why is this happening?

Connectivity issues since they are separate droplets?

Any way to troubleshoot this?

Thanks

marcosmartinez7 on 8 Jan 2019

I don't think the peer count is related to lost blocks, and neither peers or lost blocks are related to the logical deadlock caused by the same-difficulty ambiguity.

Regardless, you can use static/trusted enodes to add the peers automatically.

jmank88 on 8 Jan 2019

I add the nodes mannually, but it is weird that a sealer is always getting connectivity issues with the rest of the peers

I will try the static/trusted nodes.

I will put the block lost in a separate issue, but i would like to have a response from the geth team about the initial problem, because it seems like i can go into another deadlock issue again

Thanks @jmank88

PS: Do you think that the block sealing time can be an issue here? Im using 10 secs

marcosmartinez7 on 8 Jan 2019

'Lost blocks' are just blocks that were signed but didn't make the canonical chain. These happen constantly in clique, because most (~1/2) of the signers are eligible to sign at any given time, but only one block is chosen (usually the in-turn signer, with difficulty 2) - all of the other out-of-turn candidates become 'lost blocks'.

jmank88 on 8 Jan 2019

PS: Do you think that the block sealing time can be an issue here? Im using 10 secs

Faster times might increase the chances of bad luck or produce more opportunities for it to go wrong, but the fundamental problem always exists.

jmank88 on 8 Jan 2019

Right, i understand, so, nothing to worry into a PoA network then?

About time, yeah, completely agree

Thanks a lot!

marcosmartinez7 on 8 Jan 2019

Right, i understand, so, nothing to worry into a PoA network then?

I'm not sure what you mean. IMHO the ambiguous difficulty issues are absolutely fatal flaws - the one affecting the protocol itself is much more severe, but the client changes I linked addressed deadlocks as well.

jmank88 on 8 Jan 2019

It's also worth noting that increasing the number of signers may reduce the chance of deadlock, possibly having an odd number rather than even as well.

jmank88 on 8 Jan 2019

Yes sure, i mean, i didnt know about that, but is really good information and i really apreciate it. I was talking about the lost block warning, your explanation make sense for PoA

About # of signers, yes, i have read about that, makes sense. I have also implemented a PoC with just 2 sealers and, maybe im lucky, but in 700k blocks i did not experimented this issue.

Right now im using a odd number

marcosmartinez7 on 8 Jan 2019

👍1

Limiting to just 2 signers is a special case with no ambiguous same-difficulty blocks.

jmank88 on 8 Jan 2019

👍1

After removing 1 node and resync from the data of 1 of the nodes, i was running the network with 5 sealers without issues.

Summary:

Last block not sealed
Last block-1 sealed by 2 nodes

After 1 day it got stucked again, but now in weirdest situation:

The 5 nodes are stucked at the same block
If y query the block on each node, i have 3 different hashes for the same block
The last block is allways signed with difficulty = 1 , so, there is no sealing with difficulty = 2 so no sealing at all?
The last block -1 was sealed twice... two nodes with difficulty 2

Last block 503076

Sealer 1

Sealer 2

Sealer 4 (off turn with different hash and parent hash)

Sealer 5 (off turn, 3 side chain)

Sealer 6 (Same hash as sidechain but different parent)

The number of signers is 5

Each node is paired with 4 signers and 1 standard node

Last block -1: 503075

Sealer 1 (out off turn)

Sealer 2 (out off turn, same hash)

Sealer 4 (out off turn, different hash, same parent..)

Sealer 5 (in turn)

Sealer 6 (int turn too)

marcosmartinez7 on 9 Jan 2019

You can remove the stack traces, they are not necessary. This looks like a logical deadlock again. Can you double check your screenshots?

jmank88 on 9 Jan 2019

The lastblock -2 has some difference too, 2 nodes has different views of that node

marcosmartinez7 on 9 Jan 2019

Indeed, this looks like a multi-block fork, which has now stalled out with all branches having the same total difficulty.

jmank88 on 9 Jan 2019

The lastblock-3 is where they agree

S1.

S2.

S3.

S4.

S5.

S6.

marcosmartinez7 on 9 Jan 2019

If they all signed their own versions, then the hashes would be different. This indicates the last point where they all agreed on the same block.

jmank88 on 9 Jan 2019

Thats true, i edited my comment.

Could please anybody of go-ethereum give me a hint of what is happening here?

I have double checked that the lastblock has difficulty 1 on each node

Also, this is contradictory, i have checked the logs and i see that sealer 6 have sealed the last block, also the Sealer 2, but the difficulty is 1 on each sealer when queried!

Also, is weird that into another sealer i have 2 consecutive block sealings

marcosmartinez7 on 9 Jan 2019

They are always speculatively sealing on whatever branch is the best that they have seen, so those logs do not look unusual.

jmank88 on 10 Jan 2019

I understand, but, the last block with difficulty 1 on each node isnt usual right? i mean, they made speculative sealing and the result chain include a last block that wasnt sealed?

Do you relate this too with a deadlock? seems more like the multiple chains were corrupted since that deadlock

marcosmartinez7 on 10 Jan 2019

Can you elaborate? I'm not sure I understand.
The reason for there being only difficulty 1 blocks at the head would be that the in-turn signer had signed too recently (out-of-turn) to sign again (according to whichever branch it was following locally).

jmank88 on 10 Jan 2019

So, basically, if there are multiple forks (wrong behaviour) this could happen, but is not the expected situation (leads into a deadlock)

marcosmartinez7 on 10 Jan 2019

It is certainly not the desired behavior, but it is not wrong as defined by the clique protocol. Plus the client is arguably too strict about the edge case of peers with same total difficulty branches, which may just be due to being written originally for ethash.

jmank88 on 10 Jan 2019

👍1

Well, restarted again, running for about 10 hours and get deadlocked again,

is there any information that i can provide for this bug?

marcosmartinez7 on 10 Jan 2019

@marcosmartinez7; Hi, This seems strange. I am not really sure about it, but do you mind if I ask; Are you sure you are using different accounts for each miner (--unlock address)?

usmananwar on 14 Jan 2019

👍1

Yes, off course.

marcosmartinez7 on 14 Jan 2019

We are experiencing the same problem in our testnet and our production network. The difficulty chosen difficulties of 1 or 2 is the cause of this.

In Blockchain Federal Argentina (bfa.ar), we are sealing a new block every 5 seconds, and have seen this problem since we had around 8 sealers (now we are at around 14, I think).

I talked a bit with @marcosmartinez7 on Discord today, and it seems that one interesting solution could be to use prime numbers for difficulties, where if you are in-turn you have the highest possible prime number.

This is a protocol problem, as parts of the network does indeed get stuck in separate branches, just like @marcosmartinez7 experienced.

With monitoring you can detect it and do debug.rewind. Detecting it, doesn't stop it from happening, though.

rlegene on 17 Jan 2019

👍3

I talked a bit with @marcosmartinez7 on Discord today, and it seems that one interesting solution could be to use prime numbers for difficulties, where if you are in-turn you have the highest possible prime number.

I linked some of our fixes here: https://github.com/ethereum/go-ethereum/issues/18402#issuecomment-452141245, one of which was a protocol change to use dynamic difficulties from 1-n (for n signers) based on how recently each signer has signed. We've been running this on our mainnet since last May (5s blocks, 5-20 signers). Using primes is an interesting approach, but I'm not sure it's necessary (and could cause trouble, especially with a high number of signers). One neat feature of using 1-n is that all eligible signers will always sign with a difficulty > n/2, therefore any two consecutive out-of-turn blocks will always have a total difficulty > n and thus greater than a single in-turn block, so there won't be any late re-orgs from lagging signers producing lower numbered but higher total difficulty blocks (this is where I think primes would get you in to trouble).

jmank88 on 18 Jan 2019

👍2

This sounds pretty much like the deadlocks we experience on the Görli testnet. We were able to break it down to two issues that would greatly improve this situation:

the out-of-turn block sealing delay should be much, much higher. currently, it's sometimes lower than the network latency, causing authority nodes with high geographical distance constantly producing out-of-turn blocks. I suggest to put in at least a 5000 ms minimum delay before sealing out of turn blocks (plus random delay up to another 10000 ms). This is something that can be done without breaking the clique spec and will in most cases ensure that in-turn blocks are always faster propagated in the network than out-of-turn blocks.
the choice of 1 and 2 difficulty score for out-of-turn blocks and in-turn blocks is not ideal. two out-of-turn blocks have the same difficulty as one in-turn block. I believe, in-turn blocks must be much, much heavier, I would recommend an in-turn difficulty score of 3 to make sure, they always get priority and to avoid deadlock situation where you have two different chain tips with the same difficulty. Unfortunately, this would require a new spec / hardfork.

5chdn on 23 Jan 2019

👍2

fjl on 5 Mar 2019

🎉5

My suggestion:

If there are N miners,
distance is the number of blocks since miner X mined a block
Difficulty is calculated as min( distance, N), if X seals a block

Example, 10 signers:

Most blocks will have difficulty 10
If one signer drops off, the remaining will have difficulty 9 if they sign in-turn:ish, but lower if they don't.

holiman on 5 Mar 2019

To exit the deadlock you can set the chain back to one canonical block using: debug.setHead(hex_value)

yananli89 on 12 Mar 2019

@5chdn @fjl @karalabe PTAL on PR #19239

hadv on 13 Mar 2019

🎉1

To exit the deadlock you can set the chain back to one canonical block using: debug.setHead(hex_value)

It might work but we cannot always give an eye to the nodes and resolve the deadlock by running a command or restart the nodes. it's nightmare 😄

hadv on 13 Mar 2019

True! ^^

yananli89 on 13 Mar 2019

To exit the deadlock you can set the chain back to one canonical block using: debug.setHead(hex_value)

It might work but we cannot always give an eye to the nodes and resolve the deadlock by running a command or restart the nodes. it's nightmare 😄

Also the process to detect where the chain has forked and select the correct block to reset is not trivial

marcosmartinez7 on 18 Mar 2019

any news on this issue?

fab-spring on 12 Jul 2019

any news on this issue?

There is a PR (https://github.com/ethereum/go-ethereum/pull/19239) but still needs to be reviewed

marcosmartinez7 on 29 Jul 2019

Mine had the same issue recently. 5 sealers config.

One of them I don't know why but their block history got all messed up. The latest block was ~5100000 and this sealer was at ~3000000.

Two sealers stuck at 5125155 with matching block hashes. Other two sealers stuck at 5125155 with different matching block hashes.

I picked which two sealers I wanted the new head to be at. And for the other two stuck ones:

I dumped the blocks to a file using geth export blocks.dump 0 5125100. This made the sealers resync the last 55 blocks to the latest with the other sealers.
Deleted the chaindata dirs
geth init ...
Imported the blocks with geth import blocks.dump.

And the other weird sealer that was way off I just resynced from scratch. Everything is working good again. Just wondering why this happened in the first place.

dwalintukan on 11 Sep 2019

We're experiencing this issue too. 5s block times and 4 sealers. it ran about 5,5 months withouth any problems, now we got the same deadlock situation like described here.

Is there a plan to have a fix for this in one of the upcoming geth releases? https://github.com/ethereum/go-ethereum/pull/19239 seems to be providing a fix, but it looks like it stalled.

Did we have such Deadlocks in Rinkeby too?

ivica7 on 10 Oct 2019

I was able to resolve the issue for us. First, I noticed that I have a fork with 2 sealers being on one side and the other 2 on the other side.

What I did: I picked one sealer node and removed the peer who agrees on the same fork. Then I used debug.setHead to rewind the chain for ~50 blocks. After that everything worked fine.

So the question for me is: could we work around this issue by having a sealer count which is not dividable in partitions of same size? Something like N=3, N=5, N=7, N=11, ... ?

However, witth N=5 for instance, there seems to be a problem too, like dwalintukan posted on 11 Sep, but I am not sure if it is the same problem.

ivica7 on 11 Oct 2019

Hi,
I have the same issue, I created a Blockchain with 2 sealers, after mining the first block the sealers still just waiting for each other.
I'm using Geth Version: 1.9.6-stable:

Geth
Version: 1.9.6-stable
Git Commit: bd05968077f27f7eb083404dd8448157996a8788
Architecture: amd64
Protocol Versions: [63]
Network Id: 1
Go Version: go1.11.5
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.11

KebAz on 1 Nov 2019

i have the same issue. 1s block times and 4 sealers(not only 4 sealers, and now 5 is in the same problem). now is all sealers still just waiting for each others~

Geth
version:V1.9.7
commit a718daa674a2e23cb0c8a6789f7e5467e705bbbd
OS: linux red hat
go version go1.10.2 linux/amd64

now we have some ideas?@holiman @jmank88 @karalabe @fjl

fnaticwang on 20 Nov 2019

When i start the mining the node.node got stuck
INFO [01-27|10:26:23.850] Started P2P networking self=enode://84acace1e79f154df825955667061c79c37986a6594742999f3909fa7ea9a20b90d71cff58a66fc3ee9fa2f6be6d8f0bc455018be72532475d9405dd5cc79622@127.0.0.1:30311
INFO [01-27|10:26:24.784] Unlocked account address=0x1e8eD9837558819ffcBD8Fd20CE97976a4aB6D2f
INFO [01-27|10:26:24.785] Transaction pool price threshold updated price=1
INFO [01-27|10:26:24.785] Transaction pool price threshold updated price=1
INFO [01-27|10:26:24.785] Etherbase automatically configured address=0x1e8eD9837558819ffcBD8Fd20CE97976a4aB6D2f
INFO [01-27|10:26:24.785] Commit new mining work number=2 sealhash=3efe07…1cc8cf uncles=0 txs=0 gas=0 fees=0 elapsed=182.745µs
INFO [01-27|10:26:24.785] Signed recently, must wait for others

rajnishtech on 27 Jan 2020

@rajnishtech i dont think thats related with the issue described. That seems lika sealer waiting for another to sign. Maybe you forgot to add the other peers to that sealer? Or the other peers are not able to vote (clique.propose("0x...", true))

marcosmartinez7 on 28 Jan 2020

@marcosmartinez7 i am geeting the other issue from the other node 1
INFO [01-28|12:21:03.401] Commit new mining work number=1 sealhash=b36db4…2358e5 uncles=0 txs=0 gas=0 fees=0 elapsed=329.832µs

rajnishtech on 28 Jan 2020

@rajnishtech are you sure your two sealers have connectivity? What is the result of admin.peers on each node?

Btw i think you should create a new issue for your problem since it is not related with this issue in particular

marcosmartinez7 on 30 Jan 2020

@marcosmartinez7 have you found any ways to prevent this issue from popping up? Would increasing the block time prevent this issue?

LongJeongS on 6 Feb 2020

@LongJeongS this issue may be fixed by https://github.com/ethereum/go-ethereum/pull/19239

In my case, i stop using PoA 1 year ago, at that moment the solution was to use only 2 sealers.

If you have only 2, there is only one in turn and the other one is out of turn, so the deadlock cannot happen

But thats not a great architecture obviously..

marcosmartinez7 on 8 Feb 2020

Has this problem been solved? I have the exact same issue with 2 nodes, both sealers (limited resource setup). Even for contract deployment I get a hash and contract address but the contract is not present in chain. A restart of the chain fixes it though. After that all transactions timeout. Both nodes are connected (admin.peers shows 1 peer)

am0xf on 9 Mar 2020

Has this problem been solved? I have the exact same issue with 2 nodes, both sealers (limited resource setup). Even for contract deployment I get a hash and contract address but the contract is not present in chain. A restart of the chain fixes it though. After that all transactions timeout. Both nodes are connected (admin.peers shows 1 peer)

Are you sure that is the same problem? Because a restart should not fix it. The problem behind this issue makes a fork of the chain

marcosmartinez7 on 9 Mar 2020

A restart only makes the initially deployed contract available. Without a restart I get the error "...is chain synced / contract deployed correctly?" I haven't found a fix for further transactions timing out.

am0xf on 9 Mar 2020

Can we not run a PoA network with 1 sealer? Logs show "Signed recently. Must wait for others"

am0xf on 13 Mar 2020

Hello ,

In my private Clique network with 4 Nodes (A,B,C,D) I noticed a fork in the chain for block period 1.

I noticed that it happens some times with block period 1 & 2.

I noticed that the fork happened at block height 1500 for example. Nodes A & D have a similar chain data meanwhile Nodes B& C have similar data. (Fork occurence)

At block 1500, I noticed the difference in data between 2 chains: 1) Block hashes are different 2) Block of one chain fork is an uncle block while for other chain fork has 5000 txs included 3) Both blocks have same difficulty 2 which means that it was mined in turn as well as same sealer(complication) 4) Another complication arises when I noticed it was the same sealer who sealed the block.

This results in fork of the network and stalling at the end which cannot undergo any reorg in this deadlock situation.

In previous comments I noticed that there was atleast different difficulty and different sealers at the same block height between the forks

Please can some one let me know if you faced this issue or a logical explanation of this issue

cyrilnavessamuel on 2 Jun 2020

Do we have a solution for this PR.

We have also encountered the same issue for our network which had 5 signers and worked good for almost 2 months.
The block generation time was 1 sec.

It suddenly started to show message:
"INFO [09-05|08:50:16.267] Looking for peers peercount=4 tried=4 static=0".

We tried to start the mining by using the miner.start( ) function from all the miners/signers but it does not started to mine in the network and 3 of the nodes showed the response something like:

INFO [09-05|08:53:23.471] Commit new mining work number=7961999 sealhash="d93ccf…cdb147" uncles=0 txs=0 gas=0 fees=0 elapsed="94.336µs"
INFO [09-05|08:53:23.471] Signed recently, must wait for others INFO [09-05|08:57:23.483] Commit new mining work number=7961999 sealhash="c3f025…388121" uncles=0 txs=1 gas=21000 fees=0.0050589 elapsed="562.983µs"
INFO [09-05|08:57:23.484] Signed recently, must wait for others

and rest 2 showed showed the same response with number = 7961998.

The amazing thing was transactions were showing different in the txpool
2 nodes was showing 3 transactions in pending status.
2 nodes was showing 1 transactions in pending status.
1 nodes was showing 0 transactions in pending status.

Can anyone suggest what should I do that all nodes start mining again? I've tried a few steps and solutions but it did not help.

lmvirufan on 5 Sep 2020

Reviewed in team call: 5chdn suggestions are good. We could solve this by making out-of-turn difficulty more complicated. There should be some deterministic order to out-of-turn blocks. karalabe fears that this will introduce too much protocol complexity or large reorgs.

just came across this again. here's peter's comment on that matter: https://github.com/ethereum/EIPs/pull/2181

we drafted the EIPs 218{1,2,3} after EthCapeTown for consideration.

q9f on 5 Oct 2020

It seems the rinkeby network was stopped working for more than an hour 3 times in only one month:

time | block | stop time in minutes
-- | -- | --
12/28/2020, 09:28:25 AM | 7797794 | 771
12/02/2020, 07:04:35 AM | 7648430 | 101
11/30/2020, 02:47:51 PM | 7639067 | 64

Are rinkeby down times related to this issue?

abramsymons on 3 Jan 2021

To reproduce deadlock:

Run a network with 5 sealers and stop one of them to have 4 sealers active
Use 1 second as block time
Rebuild geth by setting wiggleTime from 500 milliseconds to 1 millisecond to increase racing conditions

With such a configuration, you should have 2-3 deadlocks each hour.

abramsymons on 3 Jan 2021

We experienced such deadlocks on IDChain and solved the issue by running a deadlock resolver script on all sealer nodes that monitor the node and if chain stopped, uses debug.setHead to return the node state to n/2+1 blocks ago where n is the number of sealers. The only disadvantage of using such approach to resolve the deadlock is that it increases the number of blocks required to wait for finality from n/2+1 to n/2+2.
This script uses eth rpc api to get the last blocks, clique to get number of signers to calculate n/2+1, debug to return the node state using debug.setHead and miner to restart miner after returning state.