This is on multiple servers and seems to be completely random. After syncing for a while, sometimes a few days there will be a massive spike in CPU usage and Disk Usage (https://i.imgur.com/7ichLLG.png) and parity just gets stuck. This can happen for around 6 - 12 hours and then it seems to fix it self, or if i catch it earlier a quick restart of the instance fixes it. The only other thing running on these instances is nginx. Here is an example of parity logs when it starts:
2019-04-23 20:53:53 UTC Imported #7626040 0xa0be…787e (50 txs, 8.00 Mgas, 271 ms, 7.78 KiB)
2019-04-23 20:53:58 UTC Imported #7626041 0xa7cf…7d53 (88 txs, 7.69 Mgas, 441 ms, 15.62 KiB)
2019-04-23 20:54:06 UTC Imported #7626042 0x4b0f…61f2 (86 txs, 3.74 Mgas, 315 ms, 15.14 KiB)
2019-04-23 20:54:12 UTC Imported #7626043 0x0065…0144 (3 txs, 0.07 Mgas, 44 ms, 1.10 KiB)
2019-04-23 20:54:16 UTC 25/25 peers 136 MiB chain 389 MiB db 0 bytes queue 43 KiB sync RPC: 2 conn, 11 req/s, 3565 µs
2019-04-23 20:56:19 UTC 24/25 peers 136 MiB chain 389 MiB db 68 KiB queue 43 KiB sync RPC: 2 conn, 2 req/s, 272789 µs
2019-04-23 20:56:19 UTC 24/25 peers 136 MiB chain 389 MiB db 68 KiB queue 43 KiB sync RPC: 2 conn, 2 req/s, 272789 µs
2019-04-23 20:56:20 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 2+ 1 Qed #7626048 24/25 peers 136 MiB c$
2019-04-23 20:56:24 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 2+ 1 Qed #7626048 24/25 peers 136 MiB c$
2019-04-23 20:56:48 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 3+ 1 Qed #7626050 21/25 peers 136 MiB c$
2019-04-23 20:59:24 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:00:08 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:00:52 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:01:16 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:01:19 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:01:21 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:01:35 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:01:48 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
2019-04-23 21:01:48 UTC Syncing #7626043 0x0065…0144 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 8+ 1 Qed #7626054 0/25 peers 136 MiB c$
I seem to have a similar issue with versions >2.2.9 myself using an archive node pointing to Kovan. It syncs correctly, goes to importing and is fine for a while then it drops to sync and gets stuck in this "0.00 blk/s 0.0 tx/s 0.0 Mgas/s" state.
I have seen the issue with 2.3.9 and 2.4.5.
I have to drop back to 2.2.9 to get it working again.
Edit: My node is on a windows 10 machine.
I can also confirm this behavior on v2.2.9, v2.4.5 and v2.4.6 on linux. Running multiple full nodes with tracing on mainnet.
It seems to be happening at random blocks & I've never seen two nodes getting stuck on the same block. I never saw it fixing by itself either ... only a restart fixes it but not always; I had cases where it didn't manage to continue syncing for 6-7 restarts.
I excluded the possibility of an upgrade causing it because we've had nodes running for a long time on the same version but this only started happening recently.
Example of logs:
2019-05-22 10:01:09 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #286 5/100 peers 5 MiB chain 0 bytes db 0 bytes queue 759 KiB sync RPC: 0 conn, 4 req/s, 67 µs
2019-05-22 10:01:14 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #747 5/100 peers 5 MiB chain 0 bytes db 0 bytes queue 763 KiB sync RPC: 0 conn, 1 req/s, 3700 µs
2019-05-22 10:01:18 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #1144 5/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 2 req/s, 98904 µs
2019-05-22 10:01:23 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #1527 5/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 1 req/s, 133653 µs
2019-05-22 10:01:30 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #1906 5/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 3 req/s, 98904 µs
2019-05-22 10:01:34 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #2166 4/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 3 req/s, 99861 µs
2019-05-22 10:01:41 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #2670 4/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 3 req/s, 99861 µs
2019-05-22 10:01:44 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #3175 3/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 3 req/s, 99861 µs
2019-05-22 10:01:49 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #3810 3/100 peers 5 MiB chain 0 bytes db 0 bytes queue 2 MiB sync RPC: 0 conn, 2 req/s, 141991 µs
2019-05-22 10:01:53 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #4832 3/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 2 req/s, 183841 µs
2019-05-22 10:01:58 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #5850 3/100 peers 5 MiB chain 0 bytes db 0 bytes queue 1 MiB sync RPC: 0 conn, 2 req/s, 183841 µs
2019-05-22 10:02:03 UTC Syncing #7807258 0x94e7…cc80 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #6604 3/100 peers 5 MiB chain 0 bytes db 0 bytes queue 2 MiB sync RPC: 0 conn, 3 req/s, 113243 µs
Can you share any cache and pruning settings you're using on your node? What you are describing sounds like cache flushing to disk.
@joshua-mir Sure! Here's the configuration:
--auto-update=none
--mode=active
--tracing=on
--pruning=archive
--db-compaction=ssd
--scale-verifiers
--num-verifiers=6
--jsonrpc-server-threads=5
--jsonrpc-threads=5
--cache-size=22000
Yeah, that should be more than sufficient for what you're trying to do - (minor note, scale-verifiers overrides num_verifiers)
I can confirm what AC0DEM0NK3Y said about it being versions >2.2.9.
I downgraded to 2.2.9 and haven't had any issues since (27 days)
I'm am now in the tricky position of 2.2.9 seemingly not being usable on Kovan because of
Error: Error(Engine(NotProposer(Mismatch { expected: 0x00d6cc1ba9cf89bd2e58009741f4f7325badc0ed, found: 0xfaadface3fbd81ce37b0e19c0b65ff4234148132 })), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })
as I don't seem to be able to pull data from peers using that version (tried deleting nodes file). That above error seems to be fixed with 2.3.6 (takes it to stage 1 block error TimestampOverflow) but that now means I am getting stuck in this particular sync issue.
@AC0DEM0NK3Y syncing with kovan won't work with these versions unless you use the updated chain specification file from master. There was a hardfork on kovan introduced in 2.4.6/2.5.1
Closing issue due to its stale state.