Parity-ethereum: After upgrade to 1.11.8 in PoA-network validators miss blocks

Created on 9 Aug 2018 · 24Comments · Source: openethereum/parity-ethereum

I'm running:

Which Parity version?: 1.11.8

Which operating system?: Linux

How installed?: via binaries

Are you fully synchronized?: yes

Which network are you connected to?: core

Did you try to restart the node?: yes

We run PoA-based network (POA Core and POA Sokol). After upgrade to parity 1.11.8 we started to notice that validator nodes would sometimes miss their turn and not produce a block, which slows down the network. This seem to happen randomly to different validators.
We tried restarting parity and upgrading to a large node (more cpu/ram), but the issue still persists.

Could someone provide any guide on how to trouble-shoot/debug this to identify the rootcause?

F2-bug 🐞 M4-core ⛓ P2-asap 🌊

Source

phahulin

👍8

Most helpful comment

bumping priority here.

5chdn on 21 Aug 2018

👍12

All 24 comments

Is this the same or a similar error to https://github.com/paritytech/parity-ethereum/issues/9320

wkarshat on 9 Aug 2018

@phahulin is this the same error as #9320 ? How do you see that "validator nodes would sometimes miss their turn" ?

Tbaut on 10 Aug 2018

I don't know if it's the same error as #9320 or not.

Since we use AuRa PoA engine, validators are expected to produce blocks in round-robin fashion, in our case every 5 seconds. What we observe is that they sometimes miss their turn, so that after Nth validator next block is mined 10 seconds later by (N+2)th validator.
For example, take this block https://core-explorer.poa.network/block/4096058 it was mined at 2018-08-10T10:24:20 UTC, the previous block was correctly mined at 2018-08-10T10:24:15 (5 sec gap), but the next is mined at 2018-08-10T10:24:30 (10 sec gap).

From what we can see from netstats, all validators are online and have enough peers. Also, it doesn't happen each round.

phahulin on 10 Aug 2018

From what we can see from netstats, all validators are online and have enough peers.

Is there anything in the logs? You can probably figure out which validator failed to produce a block, can you check if it's always the same (very likely). It could be a problem with a CPU/RAM/Disk bottleneck.

Tbaut on 10 Aug 2018

No, any specific logs (logging="???") we should be looking at?
In our Core network what we see is that 15/21 validators sometimes miss blocks. They don't miss them regularly, but combined effect is very noticable.

phahulin on 13 Aug 2018

Using a validator set contract, you can report the authorities missing a block. This could be a good start. I'm not sure the -l poa flag could help but you can give it a try (list of available logs).

Tbaut on 13 Aug 2018

👍1

We use on-chain governance to manage validators list. But it won't be a good solution anyway, since almost all of them will get banned.

phahulin on 13 Aug 2018

Reporting isn't banning. You need to monitor your network in some way, (with you block explorer for instance) and find out who misses its slot.

Tbaut on 13 Aug 2018

Yeah, we do monitor. The problem is that about 15/21 started to miss their blocks after upgrade

phahulin on 13 Aug 2018

@phahulin Could you grab logs for -lengine=trace? Although it probably won't help much. Can you relate the skipped authorities to any kind of CPU load that the node is having? Does it only happen on 1.11.8, do you know if this also happens with 2.0.1?

andresilva on 14 Aug 2018

Ok, will do. But it will take some time because we don't have direct access to validators nodes, so I'll have to ask them. Please don't close this issue yet

phahulin on 14 Aug 2018

@andresilva I got logs from one validator from our test network (sokol) taken with logging="engine=trace,poa=trace,miner=trace,reorg=trace,own_tx=trace,perf=trace,stats=trace".
Here is a piece of log around the block that this node didn't produce. I formatted them a lit bit and excluded perf part, because it's very verbose but doesn't seem to be too informative. I can add it if you'll need.

A bit of context:

This validator is 0x5b2c..., he didn't produce a block in this round. Validator before him didn't produce the block too.
Our step duration is 5 sec, so there are 5 time-slots to produce blocks in this piece of log, but only 3 blocks were produced:

| Time | Step | Expected Val | Actual Block N | Block hash |
| --------------- | ------ | ----------------- | ------------------------------------------------------------ | --------------- |
| 00:03:30 | ...282 | 0xdb12... | 4033697 | 0x194b... |
| 00:03:35 | ...283 | 0xeD2A... | 4033698 | 0xe2a3... |
| 00:03:40 | ...284 | 0x4Ca0... | MISSED | - |
| 00:03:45 | ...285 | 0x5B2c... | MISSED | - |
| 00:03:50 | ...286 | 0x58ba... | 4033699 | 0x9e8a... |

From what I see, something strange happens with block 4033698 - although it corresponds to 00:03:35, a message that it was imported appears at 00:03:40. I can also add, that I checked on few other nodes and on all of them this block was imported at 00:03:40.

Then there is a message that validator 0x4Ca0... produced his block with a step gap (so he did produce it!), then reorg happens to 4033698 (how can reorgs happen in AuRa btw?)
A message about "Fetching proposer for step ...285" (the one where this validator should have produced a block) only appears at 00:03:50, while it should have appeared at 00:03:45

I asked him to check if time is in sync on his node, this is the best guess I have so far. Doesn't explain though why he's not _constantly_ missing blocks and when 4033698 was actually produced

phahulin on 15 Aug 2018

Indeed something strange happens at block #4033698, actually it is imported with a different hash first:

2018-08-15 00:03:40 UTC Verifier #1 INFO import Imported #4033698 0x5bbd…99a2 (2 txs, 0.04 Mgas, 1 ms, 0.78 KiB)

Re-orgs in Aura can happen if there's any kind of "partition" to how the nodes see the global state, that could be:

Network partitions, where nodes will only be able to communicate with a subset of authorities
Dishonest nodes that mine on more than one chain
Unsynchronized clocks

I would try to check that all the clocks are synchronized and that there isn't more than one authority online using the same key. Also, what version were you running before the upgrade? Just to see if I can try to relate this to any change that we might have done.

andresilva on 16 Aug 2018

The previous parity version was 1.10.6

phahulin on 17 Aug 2018

@andresilva, you mention three reasons that a "re-org" can happen. Could you give us a clue where in the code these decisions are being made, please?
Also, we are collecting logs that are very verbose, using
logging="engine=trace,poa=trace,miner=trace,reorg=trace,own_tx=trace,perf=trace,stats=trace"
Which ones can we eliminate, while pursuing debug of these three scenarios?

wkarshat on 20 Aug 2018

Hi @andresilva

I got logs from the validator node (0xed2a…1c36) that produced weird blocks.
Here is the piece of log that I think corresponds to the scenario above. This is piece of log is, again, without perf logs.

However, in this case, there TONS of messages from perf, similar to:

2018-08-18 15:17:07 UTC IO Worker #3 TRACE perf  queue_ancient_block: 0.04ms

Actually, about 96% of the log file consists of these messages. Also, this is weird

$ grep 'import      #' 0xed2a.log  | head -n15
2018-08-18 15:17:23 UTC IO Worker #0 INFO import      #50799   26/50 peers      4 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 744 µs
2018-08-18 15:17:59 UTC IO Worker #2 INFO import      #64261   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 565 µs
2018-08-18 15:18:33 UTC IO Worker #1 INFO import      #20193   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  4 req/s, 719 µs
2018-08-18 15:19:08 UTC IO Worker #1 INFO import      #22098   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 513 µs
2018-08-18 15:19:38 UTC IO Worker #1 INFO import      #22098   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 529 µs
2018-08-18 15:20:13 UTC IO Worker #0 INFO import      #22098   26/50 peers      9 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 507 µs
2018-08-18 15:20:43 UTC IO Worker #3 INFO import      #47483   26/50 peers      6 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 2911 µs
2018-08-18 15:21:18 UTC IO Worker #1 INFO import      #55436   26/50 peers     78 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 547 µs
2018-08-18 15:21:48 UTC IO Worker #1 INFO import      #55436   26/50 peers     79 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 524 µs
2018-08-18 15:22:18 UTC IO Worker #2 INFO import      #55436   26/50 peers    173 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 542 µs
2018-08-18 15:22:53 UTC IO Worker #3 INFO import      #48640   26/50 peers    176 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 23245 µs
2018-08-18 15:23:28 UTC IO Worker #2 INFO import      #39367   26/50 peers    198 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 67774 µs
2018-08-18 15:24:03 UTC IO Worker #1 INFO import      #30719   26/50 peers    199 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  4 req/s, 82361 µs
2018-08-18 15:24:39 UTC IO Worker #1 INFO import      #57897   26/50 peers    201 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 42810 µs
2018-08-18 15:25:13 UTC IO Worker #2 INFO import      #40736   26/50 peers    202 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 388 µs

Ancient blocks verification never completes. So the whole issue is probably related to another issue

Expected order of validators is

| Time | Step | Validator | Actual Block |
| -------- | --------- | ------------- | ----------------------------------------------------------- |
| 15:23:05 | 306921157 | 0xdb12…ea85 | 4093446 |
| 15:23:10 | 306921158 | 0xed2a…1c36 | 4093447 |
| 15:23:15 | 306921159 | 0x4ca0…46a0 | Discarded fork 4093447 ? |
| 15:23:20 | 306921160 | 0x5b2c…11e5 | Missed |
| 15:23:25 | 306921161 | 0x58ba…34cd | 4093448 |

The way I see it, 0xed2a…1c36 starts to produce new block at the correct time

2018-08-18 15:23:10 UTC IO Worker #3 TRACE engine  Fetched proposer for step 306921158: 0xed2a…1c36

but it doesn't complete on time (probably because the node is busy constantly verifying ancient blocks).

As a consequence at 15:23:16 it receives a new block #4093447 built by the next validator 0x4ca0…46a0 with a step gap

2018-08-18 15:23:16 UTC Verifier #0 DEBUG engine  Author 0x4ca0…46a0 built block with step gap. current step: 306921159, parent step: 306921157

Then, finally, it produces own version of block #4093447:

2018-08-18 15:23:17 UTC IO Worker #3 TRACE engine  generate_seal: Issuing a block for step 306921158.
2018-08-18 15:23:17 UTC IO Worker #3 TRACE miner  Received a Regular seal.

which causes the network to abandon previous version of #4093447 built by 0x4ca0…46a0 and adopt the one from 0xed2a…1c36:

2018-08-18 15:23:17 UTC IO Worker #3 INFO reorg  Reorg to #4093447 0x3105…8229 (0xcb66…4903 #4093446 0xcbeb…269f )
2018-08-18 15:23:17 UTC IO Worker #3 TRACE engine  Multi ValidatorSet retrieved for block 509355.
2018-08-18 15:23:17 UTC IO Worker #3 DEBUG engine  Zooming to epoch for block 0x3105…8229

This makes validator 0x4ca0…46a0 effectively miss a block (is this a bug? shouldn't network abandon "late" version of #4093447)?

phahulin on 20 Aug 2018

bumping priority here.

5chdn on 21 Aug 2018

👍12

@wkarshat Only double votes are handled explicitly (when using the Validator Set here), everything else is handled implicitly by chain difficulty.

@phahulin I think your assessment is correct, ancient block import is blocking import of new blocks. Both processes require write lock at least on the database, I'll see if we can reduce contention. I still don't understand why it keeps backtracking, could you try also fetching trace logs for (network and sync)? It will be a lot of logs but you can just dump them here (or contact me on gitter/riot) and I'll filter them on my side. I'm sorry but this one I can't replicate easily since I can't seal blocks on POA chain ;).

The reason the chain re-orgs to the "late" block is because it is mined by the expected validator (no steps were missed), the chain difficulty formula will give a higher score to chains with less missed steps. If in the meantime a block #4093448 was built on top of the "old" #4093447 then it would be kept and no re-org would happen.

andresilva on 27 Aug 2018

@andresilva I've sent you some logs in gitter, I'll try to gather some more and send it there too.

phahulin on 5 Sep 2018

👍1

Hi @andresilva I've sent you a link to logs with sync and network tracing enabled to gitter. It's about 6 hours long and syncing restartrs from 0th block 44 times during that period.

phahulin on 12 Sep 2018

phahulin on 1 Nov 2018

We're also seeing this issue on a chain configured similarly to poa.network (Aura with 5 sec blocktime): https://explorer.sigma1.artis.network/reorgs
Most nodes run parity 2.1.6-stable