Hoping can get some help on this issue:
Geth
Version: 1.9.13-stable
Architecture: amd64
Protocol Versions: [65 64 63]
Go Version: go1.13.4
Operating System: linux
Running private network with 2 nodes on AWS servers - One region went down, when I try to bring it up get error :
Error: invalid mix digest
WARN [01-02|12:32:27.063] Synchronisation failed, dropping peer peer=8538963b9b1151ef err="retrieved hash chain is invalid"
The other node is working fine and creating and verifying contracts - but I really need to more nodes to run and synchronize.
I tried to export and import the chaindata from the running node, but as soon as I start the second node I get invalid chain data.
Is there any way to recover from this ?
Please provide more information. E.g. what do you mean with "One region went down," - did geth have an unclean shutdown?
Yes there was an unclean shutdown - the EC2 instance terminated.
Unfortunately you will most likely need to resync then
I am happy to resync, but when I do it stops with the message "invalid chain data" - I tried, full, fast, light, export, import - the second node just does not want to sync
So you deleted the data-dir and it still does not want to sync?
yes correct - completely started the second node from geth init genesis.json - was syncing fine and then hits this error
the other node (the only one) is working fine - mining and verifying
did you delete the data dir?
Seems something is wrong with your cache - I think init does not delete it. Also do you use --ethash.cachedir or --ethash.dagdir ?
Ideally specify all your CLI arguments
yes I deleted the data dir. full cli
nohup geth --datadir ./datadir --keystore ./keystore --networkid=88888001 --nodiscover --syncmode=full --maxpeers 3 --nousb --verbosity 3 --cache=2048 --ethash.dagdir=./.ethash &
can you remove --ethash.dagdir=./.ethash and see if it works then?
is it safe to delete the dagdir ? I only have one node running now with production smart contracts - cannot afford to corrupt this
@atrana make a test environment for your node
@atrana make a test environment for your node
actually just did it now - removed the dagdir and restarted... looking ok so far
ok cool. Best of luck! @atrana
it started and regenerated the DAGS fine and individually the node is working, but as soon as I add a peer and start syncing I get the error:
INFO [01-04|15:00:02.345] Looking for peers peercount=0 tried=1 static=1
ERROR[01-04|15:00:02.549]
Chain config: {ChainID: 88888001 Homestead: 0 DAO:
Number: 12122670
Hash: 0x5a29e4da0cfce27ac3eca41203c7d854eb3c4c47f73279dc6f96cef68a7ac63d
Error: invalid mix digest
WARN [01-04|15:00:02.567] Synchronisation failed, dropping peer peer=8538963b9b1151ef err="retrieved hash chain is invalid"
I would recommend doing a disk-check. I suspect that the DAG data is corrupt, and since it got corrupt again it seems likely that it's due to a disk issue.
We could also verify this, if you do a shasum of the files in the dagdir / cachedir, and we could compare against the correct version.
Oh wait -- what engine are you using? I can't believe you ran ethash for 12122670 blocks?
Also, 1.9.13 suffers from a bug in mining ethash: https://github.com/ethereum/go-ethereum/security/advisories/GHSA-v592-xf75-856p . If you're at block 12M, you would definitely hit it.
Oh wait -- what engine are you using? I can't believe you ran
ethashfor12122670blocks?
We currently run 2 nodes on a private network - one of them is also constantly mining, is there another way to ensure transactions and smart contracts are updated on demand ?
Normally people use clique proof-of-authority networks, instead of ethash proof-of-work mining.
It may well be that the mining-node hit the 4gb threshold quite some whle ago, and has mined "bad blocks" for a while now.. Which means that you're in a pretty bad situation, basically. Either you need to keep the mining-bug in there, or throw away a few million blocks...
It may well be that the mining-node hit the 4gb threshold quite some whle ago, and has mined "bad blocks" for a while now.. Which means that you're in a pretty bad situation, basically. Either you need to keep the mining-bug in there, or throw away a few million blocks...
I think this error has been around since 23 Dec. I can see in the logs that the peer was dropped back then. We didn't pick up on the problem until 1st Jan (peeps being away on holidays)
I am happy to rewind back a few million blocks - how to do that ?
Normally people use
cliqueproof-of-authority networks, instead of ethash proof-of-work mining.
I have just been reading more on PoA networks and I think this suits us better, as our network has no value but requires proof of transactions. I think when we started this nearly 2 years ago clique was not available in geth (or maybe we overlooked it). Is there any way to migrate from PoW to PoA ?
No, there's no way to migrate, you'd have to start over from scratch. That said, it's possible to populate the genesis alloc with arbitrary state: balances, code and storage.
I have just been reading more on PoA networks and I think this suits us better, as our network has no value but requires proof of transactions. I think when we started this nearly 2 years ago clique was not available in geth (or maybe we overlooked it). Is there any way to migrate from PoW to PoA ?
Sounds like you'd have to rewrite your code for that.
We might bring up a clique network for all new customers (we are a small fintech startup) and slowly migrate existing customers over. But for now, coming back to above question - how to throw away the "bad blocks" so that I can upgrade to ver 1.9.24 and start mining and syncing again ?
A full-sync will verify every header, and get stuck on the first bad one. A fast-sync will not verify the PoW on every header, so it might "land" somewhere after a bad block.
So first of all, you need to figure out which the first bad block is. After that, you can do a setHead to that block, or full-sync to that point and then start mining
You could also, with some custom code, ensure that every header pow is verified during fast-sync., by changing fsHeaderCheckFrequency to 1 in downloader.go. Then a fast-sync would pinpoint the first bad block.
A full-sync will verify every header, and get stuck on the first bad one. A fast-sync will not verify the PoW on every header, so it might "land" somewhere after a bad block.
So first of all, you need to figure out which the first bad block is. After that, you can do asetHeadto that block, or full-sync to that point and then start mining
Thanks @holiman - so from the console I would use debug.setHead('#blocknum') ?
Yes. I can't guarantee it will work perfectly, but that's the way to do it. It's obviously not something that is considered part of the normal usecase -- it's a last-ditch approach to correcting a bad error, but @karalalbe put a lot of effort into making setHead behave correctly.
It probably requires a restart after the operation is finished.
@atrana Regarding the invalid mix digest, I think I can almost sure that it's because of the 4GB DAG issue. You are using the old Geth to generate the invalid DAG(which can be invalid when the DAG size exceeds 4GB).
You can use the latest released Geth to regenerate the DAG. And with the correct DAG, then apply the approach from @holiman to wipe all mined bad blocks.
Thank for the advice guys: these will be the steps I am planning to do over the weekend:
Does that sound right ?