Go-ipfs: Goroutine leak leads to OOM

Created on 3 Apr 2019 · 18Comments · Source: ipfs/go-ipfs

Version information:

tried both 0.4.19 and latest master:

go-ipfs version: 0.4.20-dev-fd15c62
Repo version: 7
System version: amd64/darwin
Golang version: go1.11.4

Type:

bug

Description:

I created the fresh repo this morning. It was working good for some time but now every time I run ipfs daemon I have a huge goroutine leak that leads to the OOM in a few minutes. I set HighWater = 60, LowWater = 30 to make sure it doesn't depend on swarm size
https://gist.github.com/requilence/8f81663a95bec7a4083e2600ff24aeda

I had the same problem a few days ago(recreated the repo after)

It is a really huge list to manually check one by one. Maybe someone has an idea where it could come from?

kinbug topiperf

Source

requilence

Most helpful comment

We have identified the goroutine buildup culprit as identify. There is a series of patches that should fix the issues:

vyzo on 9 Apr 2019

🎉3

All 18 comments

I have more details to share:
I put the debug log here:
https://github.com/ipfs/go-bitswap/blob/85e3f43f0b3b6859434b16a59c36bae6abf5d29e/peermanager/peermanager.go#L131

After 2 minutes of uptime I see:
PeerManager.getOrCreate(QmRnTcjn29vbepLtQoUJdS8cYiNYUnMSrfTsTCJZUaPFRJ) times = 3, len(pm.peerQueues) = 8299, len(uniquePeersMap) = 13437

I count unique and times this way:

    uniquePeersMapMutex.Lock()
    times := uniquePeersMap[p] + 1
    uniquePeersMap[p] = times
    uniquePeersMapMutex.Unlock()

Please notice that I have HighWater = 60, LowWater = 30. But despite this, it connects to 8299 peers

requilence on 3 Apr 2019

@requilence could create a dump as described here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning? We have a tool called stackparse for exactly this.

Stebalien on 3 Apr 2019

@Stebalien thanks. It was challenging to capture all of them before OOM as it becomes worse and eats 3GB in 1 min :-)
0.4.19.tar.gz

0.4.20@74d07eff35965a3f635d03aedaa43561c73679e2:
0.4.20.tar.gz

I have also added ipfs.stacks_grouped using goroutine?debug=1 because full stack is 64M

requilence on 3 Apr 2019

Could you post your config, minus your private keys? It looks like you're running a _relay_ which would explain all the peers.

Note: the connection manager _tries_ to keep the number of connections within the target range but it doesn't stop new connections from being created. That's what's killing your CPU (creating/removing connections). We definitely need better back-pressure, it looks like this is a bit of a runaway process.

Stebalien on 4 Apr 2019

@Stebalien
You are right, I have EnableRelayHop = true and EnableAutoRelay = true
https://gist.github.com/requilence/0d713de5a8e52d666830b696a10b6264

That's what's killing your CPU

actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?

requilence on 4 Apr 2019

You are right, I have EnableRelayHop = true

EnableAutoRelay is fine, it's EnableRelayHop that's causing everyone to use you as a relay.

actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?

It could be allocation velocity (https://github.com/ipfs/go-ipfs/issues/5530). Basically, we're allocating and deallocating really fast so go reserves a bunch of memory it thinks it might need. That's my best guess.

Stebalien on 4 Apr 2019

EnableRelayHop that's causing everyone to use you as a relay.

It was intentionally. So I guess after introducing EnableAutoRelay option demand for relays dramatically increased but offers is still very thin. So this disbalance is the core reason.

requilence on 4 Apr 2019

Likely, yes. Basically, this is a combination of two issues:

You're relaying so many nodes are trying to use you to talk to other peers.
You have a _very_ low connection limit so you're rapidly killing these connections.

Ideally, the connection manager and relay would actually _talk_ to eachother and the relay would stop accepting new connections at some point... (https://github.com/libp2p/go-libp2p-circuit/issues/65).

Stebalien on 4 Apr 2019

@requilence has disabling relay helped?

Stebalien on 4 Apr 2019

If you want to enable relay hop you will need to set limits in the connection manager.
Otherwise you will be quickly inundated with connections (our relays have 40k-50k connections active currently), which will lead to ooms.

vyzo on 6 Apr 2019

See also https://github.com/libp2p/go-libp2p-circuit/pull/69
We've identified the biggest culprit in relay memory usage, and this should make it much better.

vyzo on 8 Apr 2019

@Stebalien disabling relay doesn't help. Probably because I have already advertised my peer as a relay through DHT and it needs some time to expire

requilence on 8 Apr 2019

@vyzo sounds cool, I will try to use this patch on leaking setup and come back here with results

requilence on 8 Apr 2019

We have identified the goroutine buildup culprit as identify. There is a series of patches that should fix the issues:

vyzo on 9 Apr 2019

🎉3

@requilence could you try the latest master?

Stebalien on 30 Apr 2019

I think I'm hitting an issue similar to this where at some point connection counts start climbing rapidly past the default HighWater threshold; but I don't have the exact same configuration. While I have EnableAutoRelay = true, I have EnableRelayHop = false; I also have QUIC enabled.

Should I create a separate issue? Or would it be worth upload the debug logs (_e.g.,_ heap dump, stacks, config, ipfs swarm peers snapshots, _etc_) here?

leerspace on 1 May 2019

@leerspace please file a new issue. Also, try disabling the DHT with --routing=dhtclient (your node may now be dialable where it wasn't before).

Stebalien on 1 May 2019

I'm going to close this issue as "solved" for now. If that's not the case, please yell and I'll reopen it.

Stebalien on 1 May 2019

Was this page helpful?

0 / 5 - 0 ratings