I'm not sure if this is due to the way that I start/stop Geth, or a bug, or just the way it is...
I run geth using supervisor on Ubuntu which is a process control service like systemd
I was on v1.8.0 and I ran apt-get to update to v1.8.1 - but I had to restart the service so I ran supervisorctrl restart geth
That resulted in my fully sync'd node dropping 676 blocks. That seems like a lot of blocks to drop 😕
Lightly redacted log:
INFO [02-19|09:13:53] Imported new chain segment blocks=1 txs=216 mgas=7.993 elapsed=450.119ms mgasps=17.758 number=5119264 hash=01f218…7b9e13 cache=200.80mB
INFO [02-19|09:14:41] Imported new chain segment blocks=1 txs=163 mgas=7.993 elapsed=315.894ms mgasps=25.304 number=5119265 hash=38f2fd…113901 cache=201.31mB
INFO [02-19|09:14:53] Imported new chain segment blocks=1 txs=291 mgas=7.985 elapsed=815.572ms mgasps=9.790 number=5119266 hash=9f243a…9d4d68 cache=201.77mB
INFO [02-19|09:15:25] Maximum peer count ETH=10 LES=40 total=50
INFO [02-19|09:15:25] Starting peer-to-peer node instance=Geth/MysticRyuujin-GETH01/v1.8.1-stable-1e67410e/linux-amd64/go1.9.4
INFO [02-19|09:15:25] Allocated cache and file handles database=/geth/.ethereum/geth/chaindata cache=768 handles=512
INFO [02-19|09:15:27] Initialised chain configuration config="{ChainID: 1 Homestead: 1150000 DAO: 1920000 DAOSupport: true EIP150: 2463000 EIP155: 2675000 EIP158: 2675000 Byzantium: 4370000 Engine: ethash}"
INFO [02-19|09:15:27] Disk storage enabled for ethash caches dir=/geth/.ethereum/geth/ethash count=3
INFO [02-19|09:15:27] Disk storage enabled for ethash DAGs dir=/home/geth/.ethash count=2
INFO [02-19|09:15:27] Initialising Ethereum protocol versions="[63 62]" network=1
WARN [02-19|09:15:27] Head state missing, repairing chain number=5119266 hash=9f243a…9d4d68
INFO [02-19|09:15:28] Rewound blockchain to past state number=5118586 hash=1a86e5…7b6dd6
INFO [02-19|09:15:28] Loaded most recent local header number=5119266 hash=9f243a…9d4d68 td=2624927619703869778791
INFO [02-19|09:15:28] Loaded most recent local full block number=5118586 hash=1a86e5…7b6dd6 td=2622904571753520301193
INFO [02-19|09:15:28] Loaded most recent local fast block number=5119266 hash=9f243a…9d4d68 td=2624927619703869778791
WARN [02-19|09:15:28] Blockchain not empty, fast sync disabled
INFO [02-19|09:15:28] Loaded CHT section=155 head=122be1…21f3c7 root=f687e5…a872cf
INFO [02-19|09:15:28] Loaded bloom trie section=155 head=122be1…21f3c7 root=5354f8…cb62a3
INFO [02-19|09:15:28] Starting P2P networking
INFO [02-19|09:15:28] Mapped network port proto=udp extport=30301 intport=30301 interface=ExtIP(1.1.1.1)
INFO [02-19|09:15:28] UDP listener up self=enode://5bcfb28162bf02400c228f2383447e079e75b089cea49b8f718bd06ef0cf27c61233e856311d3811148d424eabccfa0bff9393e939b3624de44910c70f204c9c@1.1.1.1:30301
INFO [02-19|09:15:28] UDP listener up net=enode://5bcfb28162bf02400c228f2383447e079e75b089cea49b8f718bd06ef0cf27c61233e856311d3811148d424eabccfa0bff9393e939b3624de44910c70f204c9c@[::]:30301
INFO [02-19|09:15:28] RLPx listener up self=enode://5bcfb28162bf02400c228f2383447e079e75b089cea49b8f718bd06ef0cf27c61233e856311d3811148d424eabccfa0bff9393e939b3624de44910c70f204c9c@1.1.1.1:30301
INFO [02-19|09:15:28] Stats daemon started
INFO [02-19|09:15:28] Mapped network port proto=tcp extport=30301 intport=30301 interface=ExtIP(1.1.1.1)
INFO [02-19|09:15:28] Starting topic registration topic=LES2@d4e56740f876aef8
INFO [02-19|09:15:28] IPC endpoint opened url=/geth/.ethereum/geth.ipc
INFO [02-19|09:15:28] HTTP endpoint opened url=http://localhost:8545 cors=* vhosts=geth01
INFO [02-19|09:15:38] Block synchronisation started
INFO [02-19|09:15:46] Imported new chain segment blocks=4 txs=441 mgas=21.577 elapsed=4.013s mgasps=5.376 number=5118590 hash=83c758…59a479 cache=2.12mB
INFO [02-19|09:16:00] Imported new chain segment blocks=9 txs=1545 mgas=58.868 elapsed=8.339s mgasps=7.059 number=5118599 hash=344a39…6167c8 cache=8.95mB
According to @karalabe tweet:
If you terminate Geth gracefully, it will write out the state of the HEAD-128th block, so the next restart will reprocess that many blocks. If you pull the rug from underneath Geth, you get whatever was last written to disk (5 mins of active processing).
I'm unclear if the loss of 676 blocks is in line with the first sentence? According to the supervisor documentation the default kill signal is SIGTERM which should be the graceful exit path? I'll note that it took me about 15 minutes to re-sync.
https://ethereum.stackexchange.com/a/712/4757 suggests:
kill -INT `pidof geth`
Tried here - seems to shut down gracefully.
P.S. When using systemd, the default TimeoutStopSec of 90 seconds might not be enough - at least when interrupting a node that hasn't yet caught up after a previous termination.
(JIC, here's my service file.)
So I think @veox is correct. Geth seems to be expecting SIGINT or the exit command from the console. Only then does it gracefully shutdown. It seems that Geth doesn't handle SIGTERM which I think it should since that's the standard "Please shutdown" code from *nix...
When I altered supervisor to send SIGINT instead - the logs reflected a graceful shutdown.
According to most documentation SIGTERM:
It is the normal way to politely ask a program to terminate
You will still lose up to 128 blocks with a graceful termination (with default settings)
Hmmm, ok, so we have two issues here. One is to make geth play nice with sigint SIGTERM, I think that's a reasonable request, will do it. The other issue is that geth start up with 128 blocks behind. This was meant as a way to quickly pull in 128 blocks into memory on startup, but I think there's a lot of confusion stemming form it. Maybe we should just write out the last state on shutdown instead.
This should be fixed on master.
Tested with 1.8.2-unstable-4c845bdc. Does indeed terminate gracefully, and pick up where it left upon restart. Haven't yet encountered an ommer situation, where reverting to an earlier trie was required (but only tested a few times).
geth does seem to be busy with something else after closing the database, resulting in systemd force-terminating the process - but that's a different issue, and needs some more testing anyway.
Thanks for implementing this!
Most helpful comment