tried both 0.4.19 and latest master:
go-ipfs version: 0.4.20-dev-fd15c62
Repo version: 7
System version: amd64/darwin
Golang version: go1.11.4
bug
I created the fresh repo this morning. It was working good for some time but now every time I run ipfs daemon I have a huge goroutine leak that leads to the OOM in a few minutes. I set HighWater = 60, LowWater = 30 to make sure it doesn't depend on swarm size
https://gist.github.com/requilence/8f81663a95bec7a4083e2600ff24aeda
I had the same problem a few days ago(recreated the repo after)
It is a really huge list to manually check one by one. Maybe someone has an idea where it could come from?
I have more details to share:
I put the debug log here:
https://github.com/ipfs/go-bitswap/blob/85e3f43f0b3b6859434b16a59c36bae6abf5d29e/peermanager/peermanager.go#L131
After 2 minutes of uptime I see:
PeerManager.getOrCreate(QmRnTcjn29vbepLtQoUJdS8cYiNYUnMSrfTsTCJZUaPFRJ) times = 3, len(pm.peerQueues) = 8299, len(uniquePeersMap) = 13437
I count unique and times this way:
uniquePeersMapMutex.Lock()
times := uniquePeersMap[p] + 1
uniquePeersMap[p] = times
uniquePeersMapMutex.Unlock()
Please notice that I have HighWater = 60, LowWater = 30. But despite this, it connects to 8299 peers
@requilence could create a dump as described here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning? We have a tool called stackparse for exactly this.
@Stebalien thanks. It was challenging to capture all of them before OOM as it becomes worse and eats 3GB in 1 min :-)
0.4.19.tar.gz
0.4.20@74d07eff35965a3f635d03aedaa43561c73679e2:
0.4.20.tar.gz
I have also added ipfs.stacks_grouped using goroutine?debug=1 because full stack is 64M
Could you post your config, minus your private keys? It looks like you're running a _relay_ which would explain all the peers.
Note: the connection manager _tries_ to keep the number of connections within the target range but it doesn't stop new connections from being created. That's what's killing your CPU (creating/removing connections). We definitely need better back-pressure, it looks like this is a bit of a runaway process.
@Stebalien
You are right, I have EnableRelayHop = true and EnableAutoRelay = true
https://gist.github.com/requilence/0d713de5a8e52d666830b696a10b6264
That's what's killing your CPU
actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?
You are right, I have EnableRelayHop = true
EnableAutoRelay is fine, it's EnableRelayHop that's causing everyone to use you as a relay.
actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?
It could be allocation velocity (https://github.com/ipfs/go-ipfs/issues/5530). Basically, we're allocating and deallocating really fast so go reserves a bunch of memory it thinks it might need. That's my best guess.
EnableRelayHop that's causing everyone to use you as a relay.
It was intentionally. So I guess after introducing EnableAutoRelay option demand for relays dramatically increased but offers is still very thin. So this disbalance is the core reason.
Likely, yes. Basically, this is a combination of two issues:
Ideally, the connection manager and relay would actually _talk_ to eachother and the relay would stop accepting new connections at some point... (https://github.com/libp2p/go-libp2p-circuit/issues/65).
@requilence has disabling relay helped?
If you want to enable relay hop you will need to set limits in the connection manager.
Otherwise you will be quickly inundated with connections (our relays have 40k-50k connections active currently), which will lead to ooms.
See also https://github.com/libp2p/go-libp2p-circuit/pull/69
We've identified the biggest culprit in relay memory usage, and this should make it much better.
@Stebalien disabling relay doesn't help. Probably because I have already advertised my peer as a relay through DHT and it needs some time to expire
@vyzo sounds cool, I will try to use this patch on leaking setup and come back here with results
We have identified the goroutine buildup culprit as identify. There is a series of patches that should fix the issues:
@requilence could you try the latest master?
I think I'm hitting an issue similar to this where at some point connection counts start climbing rapidly past the default HighWater threshold; but I don't have the exact same configuration. While I have EnableAutoRelay = true, I have EnableRelayHop = false; I also have QUIC enabled.
Should I create a separate issue? Or would it be worth upload the debug logs (_e.g.,_ heap dump, stacks, config, ipfs swarm peers snapshots, _etc_) here?
@leerspace please file a new issue. Also, try disabling the DHT with --routing=dhtclient (your node may now be dialable where it wasn't before).
I'm going to close this issue as "solved" for now. If that's not the case, please yell and I'll reopen it.
Most helpful comment
We have identified the goroutine buildup culprit as identify. There is a series of patches that should fix the issues: