Go-ethereum: Memory Leak Still Present in TxPool as of 1.8.10

Created on 1 Jun 2018  路  5Comments  路  Source: ethereum/go-ethereum

Hi all,

I know there's been quite a bit of discussion on this topic already, and a lot of work has gone into improving performance around the TxPool with recent releases. However, we are still seeing the issue, albeit over a longer time period now. Whereas previously we were running OOM after ~3 hours, our Geth node(s) running the recent 1.8.10 release have displayed similar characteristics over a bit longer time period (~15 hours).

screen shot 2018-06-01 at 9 58 32 am

This graph shows memory usage. The first downward spike is when we restarted the server after updating to 1.8.10. The most recent downward spike is the process falling over due to OOM.

It might be worth noting that we are running Geth as a data provider for contract data, transaction confirmations, wallet balances, etc. for a DAPP communicating with the Ethereum network. This means we (and our users) are constantly querying the JSON RPC API. Not sure if this helps at all, but it's definitely receiving generous throughput on that side of things. Also worth noting that we have private geth nodes in an isolated environment that receive no such traffic that are not afflicted by the memory leak!

I've attached a dump of our geth logs. As you can see, there appears to be a ton of sleeping/dead goroutines.
geth-logs.txt

Let me know if there's any other information I can provide. Thanks for all the awesome work you guys do!

Most helpful comment

The large number of hung goroutines in promoteExecutables trying to call Feed.Send have been discussed in this PR:

https://github.com/ethereum/go-ethereum/pull/16673

However, there's also a fair amount of goroutines like this:

 goroutine 23412 [IO wait, 1023 minutes]:
 internal/poll.runtime_pollWait(0x7f4f1c032058, 0x72, 0xc47ec71a68)
         /usr/local/go/src/runtime/netpoll.go:173 +0x57
 internal/poll.(*pollDesc).wait(0xc4976ffa18, 0x72, 0xffffffffffffff00, 0x10935c0, 0x1799ca8)
         /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
 internal/poll.(*pollDesc).waitRead(0xc4976ffa18, 0xc49e16c000, 0x1000, 0x1000)
         /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
 internal/poll.(*FD).Read(0xc4976ffa00, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
         /usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
 net.(*netFD).Read(0xc4976ffa00, 0xc49e16c000, 0x1000, 0x1000, 0xc47ec71b78, 0x7cde0a, 0xc460ed5268)
         /usr/local/go/src/net/fd_unix.go:202 +0x4f
 net.(*conn).Read(0xc49251e520, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
         /usr/local/go/src/net/net.go:176 +0x6a
 net/http.(*connReader).Read(0xc460ed5260, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
         /usr/local/go/src/net/http/server.go:764 +0xf8
 bufio.(*Reader).fill(0xc48095d5c0)
         /usr/local/go/src/bufio/bufio.go:100 +0x11e
 bufio.(*Reader).Peek(0xc48095d5c0, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc47ec71ce0)
         /usr/local/go/src/bufio/bufio.go:132 +0x3a
 net/http.(*conn).readRequest(0xc4aacefe00, 0x109c6e0, 0xc433117b40, 0x0, 0x0, 0x0)
         /usr/local/go/src/net/http/server.go:941 +0xb5f
 net/http.(*conn).serve(0xc4aacefe00, 0x109c6e0, 0xc433117b40)
         /usr/local/go/src/net/http/server.go:1768 +0x4dc
 created by net/http.(*Server).Serve
         /usr/local/go/src/net/http/server.go:2795 +0x27b

I think the issue is that the HTTP server (presumably the RPC server in rpc/http.go) isn't setting Read|Write|IdleTimeout values:

https://github.com/ethereum/go-ethereum/blob/2ad511ce0912194e9f30010f01fc338c585f8634/rpc/http.go#L168

See this page for a good description of those values (and why they are important).

All 5 comments

Thanks for providing good logs, I took a quick peak and it's obviously blocked goroutines. Finding the cause for the blockage will require a deeper look, but we're on it.

Also FWIW, I dropped the max peers on our nodes from 256 to 50 and memory usage has barely crept up over the last 4 hours. Wasn't sure whether the issue is related to peer communication or RPC calls, but this seems to indicate the former.

screen shot 2018-06-01 at 6 23 11 pm

The large number of hung goroutines in promoteExecutables trying to call Feed.Send have been discussed in this PR:

https://github.com/ethereum/go-ethereum/pull/16673

However, there's also a fair amount of goroutines like this:

 goroutine 23412 [IO wait, 1023 minutes]:
 internal/poll.runtime_pollWait(0x7f4f1c032058, 0x72, 0xc47ec71a68)
         /usr/local/go/src/runtime/netpoll.go:173 +0x57
 internal/poll.(*pollDesc).wait(0xc4976ffa18, 0x72, 0xffffffffffffff00, 0x10935c0, 0x1799ca8)
         /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
 internal/poll.(*pollDesc).waitRead(0xc4976ffa18, 0xc49e16c000, 0x1000, 0x1000)
         /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
 internal/poll.(*FD).Read(0xc4976ffa00, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
         /usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
 net.(*netFD).Read(0xc4976ffa00, 0xc49e16c000, 0x1000, 0x1000, 0xc47ec71b78, 0x7cde0a, 0xc460ed5268)
         /usr/local/go/src/net/fd_unix.go:202 +0x4f
 net.(*conn).Read(0xc49251e520, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
         /usr/local/go/src/net/net.go:176 +0x6a
 net/http.(*connReader).Read(0xc460ed5260, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
         /usr/local/go/src/net/http/server.go:764 +0xf8
 bufio.(*Reader).fill(0xc48095d5c0)
         /usr/local/go/src/bufio/bufio.go:100 +0x11e
 bufio.(*Reader).Peek(0xc48095d5c0, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc47ec71ce0)
         /usr/local/go/src/bufio/bufio.go:132 +0x3a
 net/http.(*conn).readRequest(0xc4aacefe00, 0x109c6e0, 0xc433117b40, 0x0, 0x0, 0x0)
         /usr/local/go/src/net/http/server.go:941 +0xb5f
 net/http.(*conn).serve(0xc4aacefe00, 0x109c6e0, 0xc433117b40)
         /usr/local/go/src/net/http/server.go:1768 +0x4dc
 created by net/http.(*Server).Serve
         /usr/local/go/src/net/http/server.go:2795 +0x27b

I think the issue is that the HTTP server (presumably the RPC server in rpc/http.go) isn't setting Read|Write|IdleTimeout values:

https://github.com/ethereum/go-ethereum/blob/2ad511ce0912194e9f30010f01fc338c585f8634/rpc/http.go#L168

See this page for a good description of those values (and why they are important).

FWIW #16880 fixes the slow connections, but not the large number of promoteExecutable goroutines in your dump. I think it's fine to keep this closed though.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bgrieder picture bgrieder  路  3Comments

ysqi picture ysqi  路  3Comments

VenusHu picture VenusHu  路  3Comments

362228416 picture 362228416  路  3Comments

tymat picture tymat  路  3Comments