Go-ethereum: downloader: Lock-up during sync due to circular return logic

Created on 20 Apr 2018  Â·  15Comments  Â·  Source: ethereum/go-ethereum

EDIT: Original title: Lock-up during initial sync

EDIT: See this comment for current best guess on cause.


System information

Geth version: v1.8.4-stable-2423ae01/linux-amd64/go1.10 (installed via Ubuntu PPA package)
OS & Version: Ubuntu 16.04.4 LTS (Xenial Xerus)
Machine: KVM VPS

% uname -a
Linux <hostname> 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Expected behaviour

Continuous fast-sync.

Running via systemd with:

/usr/bin/geth --pprof --metrics --datadir /home/geth/.ethereum --cache 4096 --txpool.pricelimit 31337000 --syncmode fast --ethstats "veox-geth-lightserv-new-RESYNC:[email protected]"

Actual behaviour

After seemingly-normal operation, and dropping off "stalling" peers once in a while, non-debug log output stops, as shown in this log tail.

At this point, in console using geth attach:

> eth.syncing
{
  currentBlock: 4460991,
  highestBlock: 5402762,
  knownStates: 15834634,
  pulledStates: 15823975,
  startingBlock: 4327715
}
> admin.peers.length
25

Setting

> debug.vmodule("p2p=4,downloader=4")

results in

DEBUG[04-20|15:06:23] Recalculated downloader QoS values       rtt=5.195478857s confidence=1.000 ttl=15.586452156s

being printed repeatedly (the time changes - as expected; rtt/ttl values don't).

Forcibly disconnecting a peer using admin.removePeer("<enode>") works, a new peer is selected from the pool. In other words: p2p still works fine(-ish?).

Steps to reproduce the behaviour

Not sure; this is possibly related to networking conditions on the machine.

Happens anywhere between 5 minutes and 1 hour after starting the node.

Rambling

If I had to hazard a guess, I'd say the node corners itself into selecting peers so fast, that a small traffic spike on the VPS tower makes them all look just slow enough to be dropped.

After that, either the downloader fails to realise the sync-peers are no longer there; QoS fails at hysteresis; all remaining peers are malicious; or something of the sort.

Backtrace

See gist.

Most helpful comment

I need to roll out a production app very soon and need a stable version of geth. (...) I would like to know if anyone recommends a specific stable version (...)?

@GeeeCoin If @reductionista has correctly identified the cause, rolling back one MINOR release (to v1.8.3) should work. I've rolled back a different server to v1.8.0 and it seems stable; but that has already fully synced previously.

All 15 comments

I see that on every restart after such a failure, between the "shutting down" message and the actual shutdown, there's this error:

% journalctl --output=cat -b -u geth | grep ERR
ERROR[04-20|13:37:34] Failed to unregister sync peer           peer=d53f364380e3b4a1 err="peer is not registered"
ERROR[04-20|13:40:30] Failed to unregister sync peer           peer=543206370b63c847 err="peer is not registered"
ERROR[04-20|15:08:48] Failed to unregister sync peer           peer=f93f7593eb6bf220 err="peer is not registered"
ERROR[04-20|15:46:21] Failed to unregister sync peer           peer=5e1b61e2ac555a7a err="peer is not registered"

(Now running with debug.verbosity(4), because I don't know what modules I should be actually looking at.)


Grepping for those peer IDs (with some TRACEs from previous runs, when I was fiddling with the logger):

% journalctl --output=cat -b -u geth | grep -E "(d53f364380e3b4a1|543206370b63c847|f93f7593eb6bf220|5e1b61e2ac555a7a)"
ERROR[04-20|13:37:34] Failed to unregister sync peer           peer=d53f364380e3b4a1 err="peer is not registered"
TRACE[04-20|13:38:42] Received node data response              peer=543206370b63c847 count=300     dropped=false timeout=false
TRACE[04-20|13:38:42] Peer throughput measurements updated     peer=543206370b63c847 hps=13.076   bps=1.883  rps=0.607   sps=16.636 miss=0 rtt=8.817892115s
TRACE[04-20|13:38:42] Delivered new batch of data              peer=543206370b63c847 type=receipts  count=6
TRACE[04-20|13:38:42] Peer throughput measurements updated     peer=543206370b63c847 hps=18.804   bps=1.883  rps=0.607   sps=16.636 miss=0 rtt=8.209003701s
TRACE[04-20|13:38:42] Delivered new batch of data              peer=543206370b63c847 type=headers   count=192
TRACE[04-20|13:38:42] Requesting new batch of data             peer=543206370b63c847 type=headers   from=4207467
TRACE[04-20|13:38:43] Peer throughput measurements updated     peer=543206370b63c847 hps=18.804   bps=1.883  rps=0.607   sps=24.730 miss=0 rtt=7.69555938s
TRACE[04-20|13:38:43] Requesting new batch of data             peer=543206370b63c847 type=state     count=384
TRACE[04-20|13:38:43] Peer throughput measurements updated     peer=543206370b63c847 hps=18.804   bps=3.104  rps=0.607   sps=24.730 miss=0 rtt=7.167196521s
TRACE[04-20|13:38:43] Delivered new batch of data              peer=543206370b63c847 type=bodies    count=34:34
DEBUG[04-20|13:40:12] Stalling delivery, dropping              peer=543206370b63c847 type=receipts
DEBUG[04-20|13:40:12] Header download terminated               peer=543206370b63c847
ERROR[04-20|13:40:30] Failed to unregister sync peer           peer=543206370b63c847 err="peer is not registered"
DEBUG[04-20|13:40:30] Removing p2p peer                        id=543206370b63c847 conn=inbound duration=2m23.656s peers=20 req=false err="read tcp <my-ip-snipped>:30303-><their-ip-snipped>:43252: i/o timeout"
WARN [04-20|13:46:54] Stalling state sync, dropping peer       peer=5e1b61e2ac555a7a
DEBUG[04-20|13:46:54] Removing p2p peer                        id=5e1b61e2ac555a7a conn=inbound duration=13.707s   peers=15 req=false err="useless peer"
WARN [04-20|14:11:05] Stalling state sync, dropping peer       peer=f93f7593eb6bf220
DEBUG[04-20|15:08:48] Removing static node                     node=enode://f93f7593eb6bf2209ed1486e9b15b081a7085d7152dcfa36b62ae3ddcdda7b03c3c51ad8898a9a453e48dfad46d112735590ea9b5cb68a057fbb7b0a448a7015
ERROR[04-20|15:08:48] Failed to unregister sync peer           peer=f93f7593eb6bf220 err="peer is not registered"
DEBUG[04-20|15:08:48] Removing p2p peer                        id=f93f7593eb6bf220 conn=inbound duration=1h12m29.948s peers=24 req=false err="disconnect requested"
DEBUG[04-20|15:27:07] Unrequested node data                    peer=5e1b61e2ac555a7a len=2
DEBUG[04-20|15:27:07] Unrequested node data                    peer=5e1b61e2ac555a7a len=40
DEBUG[04-20|15:33:06] Header request timed out                 peer=5e1b61e2ac555a7a elapsed=6.000006s
ERROR[04-20|15:46:21] Failed to unregister sync peer           peer=5e1b61e2ac555a7a err="peer is not registered"
DEBUG[04-20|15:46:21] Removing p2p peer                        id=5e1b61e2ac555a7a conn=inbound duration=21m39.577s peers=6  req=false err="client quitting"

(Suggestions for egrep patterns welcome.)

That "Removing static node" message is from when I tried to manually remove the peer using admin.removePeer(<enode>), as described in the original report.

Prior to every lock-up, there is also this:

% journalctl --output=cat -b -u geth | grep Rolled
WARN [04-20|12:48:38] Rolled back headers                      count=2048 header=4222970->4220922 fast=4185744->4185744 block=0->0
WARN [04-20|13:40:12] Rolled back headers                      count=2048    header=4234602->4232554 fast=4197105->4197105 block=0->0
WARN [04-20|14:11:13] Rolled back headers                      count=2048 header=4497699->4495651 fast=4459957->4459957 block=0->0
WARN [04-20|15:33:07] Rolled back headers                      count=2048 header=4635071->4633023 fast=4598545->4598545 block=0->0
WARN [04-20|16:14:56] Rolled back headers                      count=2048 header=4860764->4858716 fast=4838505->4838505 block=0->0
WARN [04-20|16:43:17] Rolled back headers                      count=2048 header=4939473->4937425 fast=4902319->4902319 block=0->0
WARN [04-20|16:56:02] Rolled back headers                      count=2048 header=4941607->4939559 fast=4905619->4905619 block=0->0

The issue seems to disappear after the _block_ sync is done; the node is currently mostly pulling in _state_, and hasn't experienced a lock-up (therefore hasn't been restarted) for over an hour.


EDIT: Untrue; they become more rare, but they don't disappear completely.

same issue here:
instance: Geth/v1.8.4-stable-2423ae01/linux-amd64/go1.10
cmdline: ./geth-1.8.4 --fast --cache 4096 --rpc --rpcaddr xxx.xxx.xxx.xxx --rpcvhosts * --nat extip:xxx.xxx.xxx.xxx --port 30304 --lightserv 1 --lightpeers 2 --rpcapi admin,db,eth,debug,miner,net,shh,txpool,personal,web3 console

1st time after a Rolled back headers message,
2nd time:
INFO [04-20|21:43:02] Imported new block receipts count=679 elapsed=5.380s number=4786290 hash=716762��4a7038 size=61.98mB ignored=0
INFO [04-20|21:43:05] Imported new state entries count=1490 elapsed=5.710ms processed=13852964 pending=49488 retry=4 duplicate=2108 unexpected=6978
INFO [04-20|21:43:06] Imported new state entries count=1906 elapsed=11.368ms processed=13854870 pending=47850 retry=3 duplicate=2108 unexpected=6978
INFO [04-20|21:43:06] Imported new block receipts count=599 elapsed=4.256s number=4786889 hash=422d7f��04f93d size=53.36mB ignored=0
INFO [04-20|21:43:06] Imported new state entries count=1853 elapsed=10.791ms processed=13856723 pending=46175 retry=0 duplicate=2108 unexpected=6978
INFO [04-20|21:43:09] Imported new block receipts count=366 elapsed=2.297s number=4787255 hash=f8027a��fbdf2e size=31.49mB ignored=0
INFO [04-20|21:43:09] Imported new state entries count=1268 elapsed=5.896ms processed=13857991 pending=47443 retry=3 duplicate=2108 unexpected=6978
INFO [04-20|21:43:12] Imported new block receipts count=541 elapsed=2.908s number=4787796 hash=a3b931��406f54 size=44.17mB ignored=0
INFO [04-20|21:43:12] Imported new block receipts count=146 elapsed=679.101ms number=4787942 hash=0fbd56��fe5506 size=11.48mB ignored=0
INFO [04-20|21:43:13] Imported new block receipts count=24 elapsed=182.224ms number=4787966 hash=45d1f9��f74e67 size=1.71mB ignored=0
INFO [04-20|21:43:14] Imported new state entries count=1505 elapsed=15.724ms processed=13859496 pending=47645 retry=5 duplicate=2108 unexpected=6978
INFO [04-20|21:43:14] Imported new state entries count=1996 elapsed=11.544ms processed=13861492 pending=45424 retry=68 duplicate=2108 unexpected=6978

web3.eth.syncing
{
currentBlock: 4787966,
highestBlock: 5475341,
knownStates: 13906916,
pulledStates: 13861492,
startingBlock: 1398556
}

when stalled, %cpu goes down from ~200% to ~60%, disk i/o stays at ~40MB/s read ~40MB/s write

once again:
INFO [04-20|22:27:57] Imported new block receipts count=427 elapsed=5.318s number=4875235 hash=8d2951��e1cbb7 size=63.80mB ignored=0
INFO [04-20|22:28:00] Imported new state entries count=411 elapsed=1.502ms processed=14691362 pending=54926 retry=0 duplicate=35 unexpected=162
INFO [04-20|22:28:03] Imported new block headers count=2048 elapsed=3.958s number=4912894 hash=2f370f��4c9606 ignored=0
WARN [04-20|22:28:04] Rolled back headers count=2048 header=4912894->4910846 fast=4875235->4875235 block=0->0
INFO [04-20|22:28:04] Imported new state entries count=1446 elapsed=5.436ms processed=14692808 pending=54292 retry=0 duplicate=35 unexpected=162
INFO [04-20|22:28:05] Imported new block receipts count=396 elapsed=7.629s number=4875631 hash=a7a58f��dd5b50 size=55.68mB ignored=0

web3.eth.syncing
{
currentBlock: 4875631,
highestBlock: 5475934,
knownStates: 14747100,
pulledStates: 14692808,
startingBlock: 4787966
}
admin.peers.forEach(function(value){console.log(value.network.remoteAddress+"\t"+value.name)})
103.218.156.70:43104 Geth/v1.8.2-stable-b8b9f7f4/linux-amd64/go1.9.2
222.211.220.180:48448 Geth/v1.8.3-stable/darwin-amd64/go1.10
163.158.112.239:59764 Geth/v1.8.2-stable-b8b9f7f4/windows-amd64/go1.9.2
37.187.138.208:30303 Geth/v1.8.0-unstable-02aeb3d7/linux-amd64/go1.9.2
177.47.25.73:30303 Geth/v1.8.2-stable-b8b9f7f4/linux-amd64/go1.9.4
139.59.89.55:30303 Geth/v1.8.2-stable/linux-amd64/go1.8
188.242.205.9:36933 Geth/v1.8.4-stable-2423ae01/windows-amd64/go1.10.1
34.229.52.243:52098 Geth/v1.8.3-stable/linux-amd64/go1.10
207.140.28.134:64547 Geth/v1.8.2-stable-b8b9f7f4/linux-amd64/go1.9.4
190.2.131.22:30303 Geth/v1.8.1-stable-1e67410e/linux-amd64/go1.9.4
111.223.244.41:30303 Geth/v1.8.3-stable-329ac18e/linux-amd64/go1.10
98.236.39.57:64075 Geth/v1.8.2-stable-b8b9f7f4/windows-amd64/go1.9.2
47.74.13.140:56860 Geth/v1.7.3-stable-4bb3c89d/linux-amd64/go1.9
46.4.68.99:53836 Geth/v1.7.3-stable-4bb3c89d/linux-amd64/go1.9.2
94.23.209.80:30303 Geth/myethereumpool.com/v1.8.0-unstable-c6069a62/linux-amd64/go1.9.2
18.233.150.214:55530 Geth/v1.8.3-stable/linux-amd64/go1.10
13.57.23.192:58834 Geth/v1.7.3-stable/linux-amd64/go1.8.4
122.128.111.212:58598 Geth/v1.8.2-stable-b8b9f7f4/linux-amd64/go1.9.4
91.244.6.40:50797 Geth/v1.8.1-stable-1e67410e/darwin-amd64/go1.9.4
18.232.102.125:35860 Geth/v1.8.3-stable/linux-amd64/go1.10
184.82.140.49:51900 Geth/v1.8.2-stable-b8b9f7f4/linux-amd64/go1.9.4
54.250.244.254:58398 Geth/v1.7.3-stable-4bb3c89d/linux-amd64/go1.9.2
74.75.46.39:30303 Geth/v1.8.4-stable-2423ae01/windows-amd64/go1.10.1
62.76.96.6:30303 Geth/v1.6.7-stable/freebsd-amd64/go1.9
98.242.137.155:51510 Parity/v1.7.8-stable-d5fcf3b-20171025/x86_64-windows-msvc/rustc1.20.0
54.173.98.115:55624 Geth/v1.8.3-stable/linux-amd64/go1.10
undefined

DEBUG[04-20|22:36:53] Bad discv4 packet addr=66.90.137.190:30303 err="bad hash"
DEBUG[04-20|22:36:53] Recalculated downloader QoS values rtt=2.339245322s confidence=1.000 ttl=7.017742983s
DEBUG[04-20|22:36:53] Bad discv4 packet addr=37.187.142.200:30303 err="bad hash"
DEBUG[04-20|22:36:53] Bad discv4 packet addr=37.187.142.200:30303 err="bad hash"
DEBUG[04-20|22:36:53] Bad discv4 packet addr=37.139.18.135:46969 err="bad hash"
DEBUG[04-20|22:36:53] Bad discv4 packet addr=37.139.18.135:46969 err="bad hash"
DEBUG[04-20|22:36:53] Bad discv4 packet addr=46.166.165.70:30303 err="bad hash"
DEBUG[04-20|22:36:53] Bad discv4 packet addr=46.166.165.70:30303 err="bad hash"
DEBUG[04-20|22:36:54] Bad discv4 packet addr=13.93.211.84:30303 err="bad hash"
DEBUG[04-20|22:36:54] Bad discv4 packet addr=13.93.211.84:30303 err="bad hash"
DEBUG[04-20|22:36:54] Revalidated node b=15 id=2b74e7004f2f5db2
DEBUG[04-20|22:36:54] Bad discv4 packet addr=190.153.141.26:30303 err="bad hash"
DEBUG[04-20|22:36:54] Bad discv4 packet addr=161.202.29.153:30303 err="bad hash"
DEBUG[04-20|22:36:54] Bad discv4 packet addr=161.202.29.153:30303 err="bad hash"
DEBUG[04-20|22:36:54] Adding p2p peer name=Pirl/v1.8.1-stable-9... addr=37.187.24.200:60606 peers=27
DEBUG[04-20|22:36:54] Ethereum peer connected id=cf065474d8ab50df conn=dyndial name=Pirl/v1.8.1-stable-908baf27/linux-amd64/go1.10
DEBUG[04-20|22:36:54] Ethereum handshake failed id=cf065474d8ab50df conn=dyndial err="Genesis block mismatch - 29a742ba74d89fc2 (!= d4e56740f876aef8)"
DEBUG[04-20|22:36:54] Removing p2p peer id=cf065474d8ab50df conn=dyndial duration=13.648ms peers=26 req=true err="subprotocol error"
DEBUG[04-20|22:36:54] Bad discv4 packet addr=73.26.151.112:30303 err="bad hash"
DEBUG[04-20|22:36:55] Bad discv4 packet addr=47.156.184.37:30303 err="bad hash"
DEBUG[04-20|22:36:55] Adding p2p peer name=Geth/v1.8.2-stable-b... addr=159.65.167.139:30303 peers=27
DEBUG[04-20|22:36:55] Ethereum peer connected id=505c74d11a29bb8b conn=dyndial name=Geth/v1.8.2-stable-b8b9f7f4/linux-amd64/go1.9.4
DEBUG[04-20|22:36:55] Ethereum handshake failed id=505c74d11a29bb8b conn=dyndial err=EOF
DEBUG[04-20|22:36:55] Removing p2p peer id=505c74d11a29bb8b conn=dyndial duration=106.743ms peers=26 req=true err="subprotocol error"
DEBUG[04-20|22:36:55] Recalculated downloader QoS values rtt=2.339245322s confidence=1.000 ttl=7.017742983s
DEBUG[04-20|22:36:55] Bad discv4 packet addr=69.137.42.160:30303 err="bad hash"
DEBUG[04-20|22:36:55] Bad discv4 packet addr=69.137.42.160:30303 err="bad hash"
DEBUG[04-20|22:36:55] Bad discv4 packet addr=35.183.17.229:30303 err="bad hash"
DEBUG[04-20|22:36:55] Bad discv4 packet addr=35.183.17.229:30303 err="bad hash"
DEBUG[04-20|22:36:56] Adding p2p peer name=Geth/v1.8.4-stable/d... addr=218.102.216.95:30303 peers=27
DEBUG[04-20|22:36:56] Ethereum peer connected id=5ffd6db3b5704cd0 conn=dyndial name=Geth/v1.8.4-stable/darwin-amd64/go1.10.1
DEBUG[04-20|22:36:56] Bad discv4 packet addr=106.51.16.180:6796 err="bad hash"
DEBUG[04-20|22:36:56] Fetching batch of headers id=5ffd6db3b5704cd0 conn=dyndial count=1 fromnum=1920000 skip=0 reverse=false
DEBUG[04-20|22:36:56] Peer discarded announcement peer=aad2e4d0768efa04 number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:56] Peer discarded announcement peer=290f087f29279aab number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:56] Bad discv4 packet addr=73.26.151.112:30303 err="bad hash"
DEBUG[04-20|22:36:56] Verified to be on the same side of the DAO fork id=5ffd6db3b5704cd0 conn=dyndial
DEBUG[04-20|22:36:56] Peer discarded announcement peer=d880025a24fe8e41 number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:56] Peer discarded announcement peer=143ce4eed974fadf number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:57] Bad discv4 packet addr=46.39.167.51:30303 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=46.39.167.51:30303 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=78.171.68.225:30303 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=78.171.68.225:30303 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=40.118.3.223:30304 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=40.118.3.223:30304 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=24.85.99.142:30303 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=24.85.99.142:30303 err="bad hash"
DEBUG[04-20|22:36:57] Discarded propagated block, too far away peer=cc5bafc20bf77b82 number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:57] Bad discv4 packet addr=171.214.245.237:27648 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=171.214.245.237:27648 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=84.165.151.174:30303 err="bad hash"
DEBUG[04-20|22:36:57] Bad discv4 packet addr=84.165.151.174:30303 err="bad hash"
DEBUG[04-20|22:36:58] Peer discarded announcement peer=f25d29b773b2c9d9 number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:58] Peer discarded announcement peer=f25d29b773b2c9d9 number=5476106 hash=aae813��076358 distance=5476106
DEBUG[04-20|22:36:58] Discarded propagated block, too far away peer=fb8158e0b4ec5c75 number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:58] Recalculated downloader QoS values rtt=2.339245322s confidence=1.000 ttl=7.017742983s
DEBUG[04-20|22:36:58] Discarded propagated block, too far away peer=143ce4eed974fadf number=5476107 hash=ab5707��0d3567 distance=5476107
DEBUG[04-20|22:36:58] Bad discv4 packet addr=37.139.18.135:46969 err="bad hash"
DEBUG[04-20|22:36:58] Bad discv4 packet addr=37.139.18.135:46969 err="bad hash"
DEBUG[04-20|22:36:58] Peer discarded announcement peer=bb5f974b9b29475a number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:58] Bad packet from 54.86.144.203:30303: bad prefix
DEBUG[04-20|22:36:58] Bad discv4 packet addr=190.153.141.26:30303 err="bad hash"
DEBUG[04-20|22:36:58] Bad discv4 packet addr=47.156.184.37:30303 err="bad hash"
DEBUG[04-20|22:36:58] Bad discv4 packet addr=65.181.55.82:27382 err="bad hash"
DEBUG[04-20|22:36:58] Bad discv4 packet addr=159.65.141.137:30303 err="bad hash"
DEBUG[04-20|22:36:58] Bad discv4 packet addr=159.65.141.137:30303 err="bad hash"
DEBUG[04-20|22:36:58] Bad discv4 packet addr=109.203.112.175:30303 err="bad hash"
DEBUG[04-20|22:36:58] invalid neighbour (59.58.56.140) from [email protected]:30303: low port
DEBUG[04-20|22:36:58] Bad discv4 packet addr=109.203.112.175:30303 err="bad hash"
DEBUG[04-20|22:36:58] Peer discarded announcement peer=d880025a24fe8e41 number=5476107 hash=ab5707��0d3567 distance=5476107
DEBUG[04-20|22:36:59] Bad discv4 packet addr=84.160.185.4:30303 err="bad hash"
DEBUG[04-20|22:36:59] Discarded propagated block, too far away peer=1adfa9dc038e0ef3 number=5476107 hash=ab5707��0d3567 distance=5476107
DEBUG[04-20|22:36:59] Discarded propagated block, too far away peer=d880025a24fe8e41 number=5476108 hash=ed3e5f��d6336d distance=5476108
DEBUG[04-20|22:36:59] Peer discarded announcement peer=fb8158e0b4ec5c75 number=5476107 hash=ab5707��0d3567 distance=5476107
DEBUG[04-20|22:36:59] Peer discarded announcement peer=cc5bafc20bf77b82 number=5476107 hash=ab5707��0d3567 distance=5476107
DEBUG[04-20|22:36:59] Peer discarded announcement peer=d880025a24fe8e41 number=5476107 hash=b22055��6491f1 distance=5476107
DEBUG[04-20|22:36:59] Peer discarded announcement peer=0ac301696a4340e1 number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:36:59] Peer discarded announcement peer=bb5f974b9b29475a number=5476107 hash=ab5707��0d3567 distance=5476107
DEBUG[04-20|22:37:00] Peer discarded announcement peer=143ce4eed974fadf number=5476107 hash=b22055��6491f1 distance=5476107
DEBUG[04-20|22:37:00] Peer discarded announcement peer=143ce4eed974fadf number=5476108 hash=2c83d2��46fd3b distance=5476108
DEBUG[04-20|22:37:00] Bad discv4 packet addr=217.210.201.99:30303 err="bad hash"
DEBUG[04-20|22:37:00] Replaced dead node b=8 id=087bb67fc15d3d28 ip=118.193.191.101 r=708e8597c380ea98 rip=35.197.91.177
DEBUG[04-20|22:37:00] Peer discarded announcement peer=d880025a24fe8e41 number=5476108 hash=2c83d2��46fd3b distance=5476108
DEBUG[04-20|22:37:00] Bad discv4 packet addr=35.182.255.28:30303 err="bad hash"
DEBUG[04-20|22:37:00] Recalculated downloader QoS values rtt=2.339245322s confidence=1.000 ttl=7.017742983s
DEBUG[04-20|22:37:00] Peer discarded announcement peer=290f087f29279aab number=5476107 hash=b22055��6491f1 distance=5476107
DEBUG[04-20|22:37:00] Peer discarded announcement peer=fb8158e0b4ec5c75 number=5476107 hash=b22055��6491f1 distance=5476107
DEBUG[04-20|22:37:00] Bad discv4 packet addr=46.166.165.70:30303 err="bad hash"
DEBUG[04-20|22:37:00] Bad discv4 packet addr=46.166.165.70:30303 err="bad hash"
DEBUG[04-20|22:37:00] Peer discarded announcement peer=143ce4eed974fadf number=5476108 hash=ed3e5f��d6336d distance=5476108
DEBUG[04-20|22:37:00] Bad packet from 34.228.62.228:30303: bad prefix
DEBUG[04-20|22:37:01] Discarded propagated block, too far away peer=bb5f974b9b29475a number=5476107 hash=b22055��6491f1 distance=5476107
DEBUG[04-20|22:37:01] Peer discarded announcement peer=fb8158e0b4ec5c75 number=5476108 hash=ed3e5f��d6336d distance=5476108
DEBUG[04-20|22:37:01] Bad discv4 packet addr=52.43.69.3:30303 err="bad hash"
DEBUG[04-20|22:37:01] Peer discarded announcement peer=bb5f974b9b29475a number=5476108 hash=ed3e5f��d6336d distance=5476108
DEBUG[04-20|22:37:01] Bad discv4 packet addr=217.210.201.99:30303 err="bad hash"
DEBUG[04-20|22:37:02] Bad discv4 packet addr=206.189.6.48:30304 err="bad hash"
DEBUG[04-20|22:37:02] Peer discarded announcement peer=bb5f974b9b29475a number=5476108 hash=2c83d2��46fd3b distance=5476108
DEBUG[04-20|22:37:02] Bad discv4 packet addr=52.212.134.10:30303 err="bad hash"
DEBUG[04-20|22:37:02] Bad discv4 packet addr=52.212.134.10:30303 err="bad hash"
DEBUG[04-20|22:37:02] Bad discv4 packet addr=65.181.55.82:27382 err="bad hash"
DEBUG[04-20|22:37:02] Bad discv4 packet addr=85.248.228.251:30303 err="bad hash"
DEBUG[04-20|22:37:02] Bad discv4 packet addr=85.248.228.251:30303 err="bad hash"
DEBUG[04-20|22:37:02] Recalculated downloader QoS values rtt=2.339245322s confidence=1.000 ttl=7.017742983s
DEBUG[04-20|22:37:02] Bad discv4 packet addr=82.202.236.216:30303 err="bad hash"
DEBUG[04-20|22:37:02] Peer discarded announcement peer=f25d29b773b2c9d9 number=5476108 hash=2c83d2��46fd3b distance=5476108
DEBUG[04-20|22:37:02] Peer discarded announcement peer=f25d29b773b2c9d9 number=5476107 hash=ab5707��0d3567 distance=5476107
DEBUG[04-20|22:37:02] Peer discarded announcement peer=f25d29b773b2c9d9 number=5476107 hash=b22055��6491f1 distance=5476107
DEBUG[04-20|22:37:02] Discarded propagated block, too far away peer=f25d29b773b2c9d9 number=5476108 hash=2c83d2��46fd3b distance=5476108
DEBUG[04-20|22:37:02] Bad discv4 packet addr=192.169.153.139:30303 err="bad hash"
DEBUG[04-20|22:37:02] Bad discv4 packet addr=192.169.153.139:30303 err="bad hash"
DEBUG[04-20|22:37:02] invalid neighbour (93.120.235.221) from [email protected]:30303: low port
DEBUG[04-20|22:37:02] Peer discarded announcement peer=290f087f29279aab number=5476106 hash=aae813��076358 distance=5476106
DEBUG[04-20|22:37:02] Bad discv4 packet addr=177.45.243.237:30303 err="bad hash"
DEBUG[04-20|22:37:02] invalid neighbour (36.80.25.41) from [email protected]:30303: low port
DEBUG[04-20|22:37:02] Bad discv4 packet addr=177.45.243.237:30303 err="bad hash"
DEBUG[04-20|22:37:03] Peer discarded announcement peer=2e14aa28f29d73b6 number=5476107 hash=c31dd5��33b3af distance=5476107
DEBUG[04-20|22:37:03] Bad discv4 packet addr=95.165.144.143:30303 err="bad hash"
DEBUG[04-20|22:37:03] Bad discv4 packet addr=61.100.188.81:30303 err="bad hash"
DEBUG[04-20|22:37:03] Bad discv4 packet addr=61.100.188.81:30303 err="bad hash"
DEBUG[04-20|22:37:03] Bad discv4 packet addr=165.227.84.0:30303 err="bad hash"
DEBUG[04-20|22:37:03] Bad discv4 packet addr=47.156.184.37:30303 err="bad hash"

I'm seeing a similar issue, deadlocking during state sync. At first I thought it was due to using an HDD for storage, which many have warned about; but even after switching to an SSD I still get the same deadlock.

There may be more than one issue here, as sometimes even after I try to kill geth it starts to shut down but never finishes. While other times it is able to shut itself down.

I recently saved a backtrace of an example of it trying to shut down after receiving SIGINT, and then sending it 9 more to force a panic & backtrace:

Backtrace

[Update: removed "suspicious" suggestion which turned out to be unrelated and probably not an issue.]

I was going to open an issue as well, but mine is incredibly similar. I was hoping the stalls were confined to the sync process, but alas no. The issue continues after sync.

System information

Geth version: v1.8.4-stable-2423ae01/go1.10 (installed via Ubuntu PPA package)
OS & Version: Ubuntu 16.04.4 LTS
Hardware: AWS t.medium 4GB 2 core

Expected

  1. Fast sync in 3-4 hours.
  2. Continuous operation after sync.

Actual

  1. The sync stalled repeatedly and required systemd restarts to get geth communicating with the network again. This of course disrupted my sync, which required 4 days.

  2. The system stalls again. In verbosity mode=4, I see the syslog output come to a stop. In verbosity mode=5, I still see txn validation activity, however the latest block number stays fixed and state freezes, as verified by checking eth.blockNumber.

  3. The stall happens more suddenly if I use --cache=1024, although I no longer see the past Out of Memory error behavior of v1.7.

screen shot 2018-04-20 at 7 18 24 pm

I notice that a DHCPREQUEST follows interruption by 1 to 2 minutes, but I doubt this is relevant.

Steps to Reproduce Behavior

As the user stated above, possibly related to networking conditions on the machine. Happens anywhere between 2 minutes and 1 hour after starting the node.

Thoughts

I think this may involve a QoS issue. I notice that the nodes I end up networking with are not providing useful information and see continuous drops in the log output.

Does AWS throttle P2P connections? Could there be a problem with running Geth on AWS? Doubtful...

I need to roll out a production app very soon and need a stable version of geth. Systemd is useless in dealing with stalls, as there is no detectable failure for it to trigger restart. I would like to know if anyone recommends a specific stable version, v1.8.2 or v1.7 perhaps? Many thanks.

After examining this more closely, I think I see what's causing the deadlock. It's due to a bug in this commit on April 16, 2018:

eth/downloader: wait for all fetcher goroutines to exit before termin…

The problem is that wgCancel.Done() is only called when each of the fetchers return. But some of the fetchers (for example, the header fetcher as well as the state fetcher) can call dropPeer(). Inside dropPeer() (aka removePeer()) if it is the master peer it calls d.Cancel() which waits on wgCancel before it can return. But it will never finish waiting, since it is waiting on itself to return.

Thanks @reductionista for investigating! @fjl @karalabe ptal

I need to roll out a production app very soon and need a stable version of geth. (...) I would like to know if anyone recommends a specific stable version (...)?

@GeeeCoin If @reductionista has correctly identified the cause, rolling back one MINOR release (to v1.8.3) should work. I've rolled back a different server to v1.8.0 and it seems stable; but that has already fully synced previously.

switched to v1.8.3 and the issue doesnt exist

@reductionista Very nice catch, thank you!

Should be fixed on master, pushed out in 1.8.5.

Hey @reductionista, do you happen to have a public ENS name or ETH address?

I'd like to give a small tip.


reductionista.eth, I'd guess?..

@veox Wow, that's awefully sweet of you--yes, you can send it to reductionista.eth. Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

keitaj picture keitaj  Â·  3Comments

freshonline picture freshonline  Â·  3Comments

AdrianScott picture AdrianScott  Â·  3Comments

362228416 picture 362228416  Â·  3Comments

vogelito picture vogelito  Â·  3Comments