Hi all,
I know there's been quite a bit of discussion on this topic already, and a lot of work has gone into improving performance around the TxPool with recent releases. However, we are still seeing the issue, albeit over a longer time period now. Whereas previously we were running OOM after ~3 hours, our Geth node(s) running the recent 1.8.10 release have displayed similar characteristics over a bit longer time period (~15 hours).

This graph shows memory usage. The first downward spike is when we restarted the server after updating to 1.8.10. The most recent downward spike is the process falling over due to OOM.
It might be worth noting that we are running Geth as a data provider for contract data, transaction confirmations, wallet balances, etc. for a DAPP communicating with the Ethereum network. This means we (and our users) are constantly querying the JSON RPC API. Not sure if this helps at all, but it's definitely receiving generous throughput on that side of things. Also worth noting that we have private geth nodes in an isolated environment that receive no such traffic that are not afflicted by the memory leak!
I've attached a dump of our geth logs. As you can see, there appears to be a ton of sleeping/dead goroutines.
geth-logs.txt
Let me know if there's any other information I can provide. Thanks for all the awesome work you guys do!
Thanks for providing good logs, I took a quick peak and it's obviously blocked goroutines. Finding the cause for the blockage will require a deeper look, but we're on it.
Also FWIW, I dropped the max peers on our nodes from 256 to 50 and memory usage has barely crept up over the last 4 hours. Wasn't sure whether the issue is related to peer communication or RPC calls, but this seems to indicate the former.

The large number of hung goroutines in promoteExecutables trying to call Feed.Send have been discussed in this PR:
https://github.com/ethereum/go-ethereum/pull/16673
However, there's also a fair amount of goroutines like this:
goroutine 23412 [IO wait, 1023 minutes]:
internal/poll.runtime_pollWait(0x7f4f1c032058, 0x72, 0xc47ec71a68)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4976ffa18, 0x72, 0xffffffffffffff00, 0x10935c0, 0x1799ca8)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4976ffa18, 0xc49e16c000, 0x1000, 0x1000)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4976ffa00, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4976ffa00, 0xc49e16c000, 0x1000, 0x1000, 0xc47ec71b78, 0x7cde0a, 0xc460ed5268)
/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc49251e520, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:176 +0x6a
net/http.(*connReader).Read(0xc460ed5260, 0xc49e16c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:764 +0xf8
bufio.(*Reader).fill(0xc48095d5c0)
/usr/local/go/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).Peek(0xc48095d5c0, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc47ec71ce0)
/usr/local/go/src/bufio/bufio.go:132 +0x3a
net/http.(*conn).readRequest(0xc4aacefe00, 0x109c6e0, 0xc433117b40, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:941 +0xb5f
net/http.(*conn).serve(0xc4aacefe00, 0x109c6e0, 0xc433117b40)
/usr/local/go/src/net/http/server.go:1768 +0x4dc
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2795 +0x27b
I think the issue is that the HTTP server (presumably the RPC server in rpc/http.go) isn't setting Read|Write|IdleTimeout values:
See this page for a good description of those values (and why they are important).
Looks fixed as of https://github.com/ethereum/go-ethereum/pull/16880
FWIW #16880 fixes the slow connections, but not the large number of promoteExecutable goroutines in your dump. I think it's fine to keep this closed though.
Most helpful comment
The large number of hung goroutines in
promoteExecutablestrying to callFeed.Sendhave been discussed in this PR:https://github.com/ethereum/go-ethereum/pull/16673
However, there's also a fair amount of goroutines like this:
I think the issue is that the HTTP server (presumably the RPC server in
rpc/http.go) isn't settingRead|Write|IdleTimeoutvalues:https://github.com/ethereum/go-ethereum/blob/2ad511ce0912194e9f30010f01fc338c585f8634/rpc/http.go#L168
See this page for a good description of those values (and why they are important).