Go-ethereum: Can't get past pivot during fast sync

Created on 11 Dec 2017  Â·  20Comments  Â·  Source: ethereum/go-ethereum

System information

Geth version: 1.7.3 / 1.8.0-dev
OS & Version: Windows 10

Expected behaviour

Fast sync will download pivot block and switch to full mode.

Actual behaviour

Fast sync keeps downloading states forever, days even, it never downloads the first full block. Bandwidth usage is close to 0%, same for HDD usage.

Steps to reproduce the behaviour

Start fast sync from scratch and wait until pivot is reached.

Notes

Using light mode is not an alternative.

Fast sync pivot is max_block - ( 64 + rand(0,256) ) or somethng like that. That matches with all the reports that say fast sync gets stuck close to the end. It's always within that range.

I have little to none knowledge of eth. My comments are based on my review of the geth codebase.

I do not know if this the actual cause of the problem or just a symptom of an underlying problem.

I've noticed that fast sync WORKS for some people, according to google. But it also fails in a way very similar to this for a lot of other people.

Similar or possibly same issue #15001 , already checked all the comments.

Already tried with --cache values 1024, 2048, 4096, and a lot more. I have 32GB ram.

It's not an HDD problem, low I/O usage. Maybe it will once it starts trying to keep up with full blocks after the initial sync, but not now.
It's not a time sync problem, using ntp.
It's not a bandwidth problem, low BW usage.

I have deleted the whole block chain 3 times already to debug this.

I have manually confirmed (10+ times) it always get stucks on pivot - 1. On geth restart, pivot is updated and block states are imported until it reaches pivot -1.

Fast mode is automatically enabled even if you restart geth. Modifying the code to override that causes full sync to start from block 0 (which makes sense, full mode starts from last full block saved in the db)

Possibly unrelated, but here is a chuck of the debug log during endless state download. As you can see distance is too big, because current full block is = 0. This is probably just a misplaced debug message as current full block always will be 0 during fast sync, so using distance to drop states during fast sync doesn't make any sense.

DEBUG[12-10|19:25:51] Removing p2p peer                        id=1dcea46f98f2194b conn=inbound duration=101.269ms  peers=20 req=true  err="too many peers"
DEBUG[12-10|19:25:51] Discarded propagated block, too far away peer=496895dba7598b47 number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Discarded propagated block, too far away peer=496895dba7598b47 number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=938e28ff38267daf number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=f6717be95b7eb6f4 number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=b50e04afad996390 number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=c20da4a220a18f87 number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=0c74e66a402212ab number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=4a919e915c88cad0 number=4710657 hash=4c07d2…27851c distance=4710657
DEBUG[12-10|19:25:51] Discarded propagated block, too far away peer=4a919e915c88cad0 number=4710657 hash=4c07d2…27851c distance=4710657
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=f87ba5f35e371fed number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=58e4ed905b0c1680 number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=a7cb3613a511703a number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=496895dba7598b47 number=4710659 hash=e467e6…d7e33e distance=4710659
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=938e28ff38267daf number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=f6717be95b7eb6f4 number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=b50e04afad996390 number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=c20da4a220a18f87 number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=0c74e66a402212ab number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=4a919e915c88cad0 number=4710659 hash=e467e6…d7e33e distance=4710659
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=f87ba5f35e371fed number=4710659 hash=e467e6…d7e33e distance=4710659
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=58e4ed905b0c1680 number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=a7cb3613a511703a number=4710657 hash=b728cd…63a9f8 distance=4710657
DEBUG[12-10|19:25:51] Discarded propagated block, too far away peer=58e4ed905b0c1680 number=4710658 hash=583749…79d42b distance=4710658
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=938e28ff38267daf number=4710659 hash=e467e6…d7e33e distance=4710659
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=f6717be95b7eb6f4 number=4710659 hash=e467e6…d7e33e distance=4710659
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=b50e04afad996390 number=4710659 hash=e467e6…d7e33e distance=4710659
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=c20da4a220a18f87 number=4710659 hash=e467e6…d7e33e distance=4710659
DEBUG[12-10|19:25:51] Peer discarded announcement              peer=0c74e66a402212ab number=4710659 hash=e467e6…d7e33e distance=4710659

Most helpful comment

My own fast sync attempts consistently proceed to the pivot block, then happily downloads states for 12+ hours (reaching about 30M processed states), then gets completely stuck (state download stalls).

After consulting the source code, my guess is that: if your sync is too slow, the state head in your pivot block gets outdated and pruned (i.e. permanently deleted) by other nodes, so you can't download them anymore and you never get to finish the sync. Currently geth's pruning history seems to be 4096 blocks, that means if your spend more than 409614/3600 ~ 15 hours downloading
states, it would *never
finish (the processed states number stop growing). This number matches my experience.

If that's true, restarting geth --fast without geth removedb should help. This would bump the pivot block number to a more recent one where the required data will remain available for another ~15 hours. Some already-downloaded states would inevitably go down the drains (as they become outdated by blocks mined during your previous sync attempt, i.e. some cat-herding jerk bought new cats while you're spending precious time and disk and electricity syncing up your wallet and now you have to download their new cats, again), but a large portion of them should remain valid despite the counter resetting to zero (since a lot of people covered by the 50M number are, hopefully, NOT cat-herding jerks, but you don't really know since they could have just bought their first cat last night and geth has to assume the number went back to 0). So after a few restarts, one should be able to finish the sync even on a slow setting.

There is no way to know how many more states you still have to downloaded after a restart (since estimating this number is about as expensive as finish downloading them), but that number should be smaller than the ~50M required for a first attempt. Also, during the retry attempts, it's normal for the disk usage to keep growing (since you can't know which states were outdated and you have to keep everything before finishing a pivot block).

I'm making one such attempt right now. It could take days to verify this hypothesis, though...

EDIT: My guess is right, after 4 attempts with 16M, 29M, 10M, 1M states each over 2 days, my fast sync finished successfully!

All 20 comments

I think I am seeing exactly same. Geth 1.7.2 on OSX never catches up but gets stuck around in the same "pivot" range you mention. I have seen it stuck (virtually forever) within range of 80 to 300 blocks behind the head, restart catches up a bit, maybe until the new pivot point. Machine is a modern macbook with SSD. :(

it always get stucks on pivot - 1

Might be an off-by-one error somewhere?..

I haven't had this happen to me (yet), with geth 1.7.3 on an Ubuntu 16.04 Linux VM; then again, I'm not specifying --cache.


EDIT: Could you post the command that gets executed to start geth?..

@veox I don't remember the exact command at the moment (different computer) but I only specify --cache and --data-dir, everything else is default

seems to be (extremely) related to this fix (october) https://github.com/ethereum/go-ethereum/pull/15234 and https://github.com/ethereum/go-ethereum/pull/14849

It was merged and reverted, I don't know if there were any fixes afterwards.

maybe @karalabe can confirm if it's related

@grazz That fix didn't actually do anything...the check was redundant in the first place. See ValidateBody

@kriffo Yeah, it looks like the only relation is the problem it was trying to solve.
I'm kind of lost with this. I can't figure out why it works for some, and doesn't work for others.

--syncmode fast --cache 2048 --maxpeers 50 has failed on me 4 times in a row.

geth version
Geth
Version: 1.7.3-stable
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.9.2
Operating System: darwin
GOPATH=/Users/vogelito/Documents/CS/go
GOROOT=/usr/local/go

I basically sync a bunch of blocks and then it gets stuck forever importing new state entries:

INFO [12-29|07:59:44] Imported new state entries               count=1140 elapsed=4.121ms   processed=48767777 pending=12838 retry=209  duplicate=2093 unexpected=12811
INFO [12-29|07:59:44] Imported new state entries               count=1013 elapsed=4.305ms   processed=48768790 pending=12127 retry=0    duplicate=2093 unexpected=12811
INFO [12-29|07:59:45] Imported new state entries               count=954  elapsed=4.494ms   processed=48769744 pending=12172 retry=213  duplicate=2093 unexpected=12811
INFO [12-29|07:59:45] Imported new state entries               count=959  elapsed=4.622ms   processed=48770703 pending=12046 retry=53   duplicate=2093 unexpected=12811
INFO [12-29|07:59:45] Imported new state entries               count=1370 elapsed=4.422ms   processed=48772073 pending=13666 retry=341  duplicate=2093 unexpected=12811
INFO [12-29|07:59:45] Imported new state entries               count=1290 elapsed=7.356ms   processed=48773363 pending=13176 retry=0    duplicate=2093 unexpected=12811
INFO [12-29|07:59:46] Imported new state entries               count=1881 elapsed=7.500ms   processed=48775244 pending=12192 retry=0    duplicate=2093 unexpected=12811
INFO [12-29|07:59:46] Imported new state entries               count=1347 elapsed=5.251ms   processed=48776591 pending=12607 retry=0    duplicate=2093 unexpected=12811
INFO [12-29|07:59:47] Imported new state entries               count=1784 elapsed=7.814ms   processed=48778375 pending=11117 retry=55   duplicate=2093 unexpected=12811
> eth.syncing
{
  currentBlock: 4813345,
  highestBlock: 4816460,
  knownStates: 50062298,
  pulledStates: 50052685,
  startingBlock: 0
}
> eth.blockNumber
0

I'm seeing knownStates and pulledStates grow unbounded. I thought the current limit was ~30M and I'm over 50M.

I will leave this overnight and report in the AM.

Actually, this just finished:

INFO [12-29|08:04:33] Imported new state entries               count=905  elapsed=10.236ms  processed=50458070 pending=0     retry=0    duplicate=2093 unexpected=12811
INFO [12-29|08:04:34] Imported new block receipts              count=1    elapsed=12.341ms  bytes=162250   number=4813346 hash=3f0256…6dbaeb ignored=0
INFO [12-29|08:04:34] Committed new head block                 number=4813346 hash=3f0256…6dbaeb
INFO [12-29|08:04:36] Imported new chain segment               blocks=5 txs=808 mgas=35.680 elapsed=2.896s    mgasps=12.320 number=4813351 hash=9c6016…caefda
WARN [12-29|08:04:36] Skipping deep transaction reorg          depth=4813351
INFO [12-29|08:04:36] Fast sync complete, auto disabling 

So currently 50458070 state entries.

@vogelito So it does work for some. I'm glad it synced. What version are you using?

@grazz on 1.7.3-stable

@vogelito weird... for some rason parity syncs fine

just wanted to let you guys know, running 1.7.3-stable, got through 50.795.615 states and it finally finished fast sync. Took around 12 hours for the states alone.

My own fast sync attempts consistently proceed to the pivot block, then happily downloads states for 12+ hours (reaching about 30M processed states), then gets completely stuck (state download stalls).

After consulting the source code, my guess is that: if your sync is too slow, the state head in your pivot block gets outdated and pruned (i.e. permanently deleted) by other nodes, so you can't download them anymore and you never get to finish the sync. Currently geth's pruning history seems to be 4096 blocks, that means if your spend more than 409614/3600 ~ 15 hours downloading
states, it would *never
finish (the processed states number stop growing). This number matches my experience.

If that's true, restarting geth --fast without geth removedb should help. This would bump the pivot block number to a more recent one where the required data will remain available for another ~15 hours. Some already-downloaded states would inevitably go down the drains (as they become outdated by blocks mined during your previous sync attempt, i.e. some cat-herding jerk bought new cats while you're spending precious time and disk and electricity syncing up your wallet and now you have to download their new cats, again), but a large portion of them should remain valid despite the counter resetting to zero (since a lot of people covered by the 50M number are, hopefully, NOT cat-herding jerks, but you don't really know since they could have just bought their first cat last night and geth has to assume the number went back to 0). So after a few restarts, one should be able to finish the sync even on a slow setting.

There is no way to know how many more states you still have to downloaded after a restart (since estimating this number is about as expensive as finish downloading them), but that number should be smaller than the ~50M required for a first attempt. Also, during the retry attempts, it's normal for the disk usage to keep growing (since you can't know which states were outdated and you have to keep everything before finishing a pivot block).

I'm making one such attempt right now. It could take days to verify this hypothesis, though...

EDIT: My guess is right, after 4 attempts with 16M, 29M, 10M, 1M states each over 2 days, my fast sync finished successfully!

@HouQiming Nice find. I wonder if I can consider this issue closed, or if it would make sense to fix this at the code level. Basically, bumping the pivot every X states.

@HouQiming wow, that's a great find! It explains a lot of reports that we've been seeing. Thank you!
Edit: Hm, seems it's not correct though, there's no pruning.

@holiman The thing is, it's not necessarily geth pruning. During my impatient waits for long syncs, I saw a lot of peers running Parity and stuff. Sometimes I stop getting states after ~1024 blocks, so I suspect it's caused by a more aggressive alt-wallet (or a more aggressive configuration) which people were switching to due to perceived geth sync failures.

In geth 1.8.13 I see this message
WARN [08-21|19:25:24.283] Pivot became stale, moving old=3882827 new=3882892
So maybe "bumping the pivot every X states" has been implemented?

After that message eth.syncing seems to stall with an unchanging currentBlock
eth.syncing { currentBlock: 3882942, highestBlock: 3883026, knownStates: 15705856, pulledStates: 15692947, startingBlock: 913048 }
Where the currentBlock value is higher than the pivot, and the status message indicate "Imported new block headers" regularly, but there are no more "Imported block receipts" messages.

Exiting geth with ^C and restarting immediately fetched 36 new block receipts. But then the regular "Imported new block headers" every 12-15 seconds increased the highestBlock faster than "Imported new block receipts" caught up with that.

There's no indication whether it's the incoming rate of the block receipts from other nodes that is the issue, or maybe slow processing of them once they arrive. I suspect it's related to how the protocol defines a "new" block receipt and decides which to fetch from the connected peers.

The problem for me is the complexity: with different implementations (Geth, Parity, Pirl) and different versions (both stable and beta) it's possible the other nodes on the testnet may just not be reliable. There are so many variables and possible interactions it surprises me that it ever synchronises (it did once in the last 2 weeks but then Microsoft updated Windows and it all went out of sync again).

We are consistently doing fast-syncs on testing infrastructure, and it consistently works. Since the original report was made (1.7.3) a lot has happened. Please open a new ticket for remaining problems, as this ticket getting a bit unwieldy.

BTW, moving the pivot during sync was implemented a while ago. 

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aomiba picture aomiba  Â·  3Comments

aakilfernandes picture aakilfernandes  Â·  3Comments

VenusHu picture VenusHu  Â·  3Comments

cheershendtco picture cheershendtco  Â·  3Comments

keitaj picture keitaj  Â·  3Comments