Go-ethereum: Memory leak on go-ethereum 1.8.7

Created on 11 May 2018 · 40Comments · Source: ethereum/go-ethereum

System information

Geth version:

Version: 1.8.7-stable
Git Commit: 66432f3821badf24d526f2d9205f36c0543219de
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

OS & Version: Linux

Expected behaviour

Memory usage of go-ethereum should be stable.

Actual behaviour

Something happens which causes memory usage to increase substantially.

Steps to reproduce the behaviour

Unknown

pprof top

This wasn't hugely useful to us, but hoping it means something to the folks in here. Happy to provide additional information. This is currently happening ~1/day
````

go tool pprof http://localhost:6060/debug/pprof/heap

Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /root/pprof/pprof.geth.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: geth
Build ID: fb815f00e769fd84f7336a86cf929a66ba8ab66f
Type: inuse_space
Time: May 9, 2018 at 9:40pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5332.62MB, 84.39% of 6318.78MB total
Dropped 349 nodes (cum <= 31.59MB)
Showing top 10 nodes out of 145
flat flat% sum% cum cum%
1186.93MB 18.78% 18.78% 1186.93MB 18.78% runtime.malg
861.97MB 13.64% 32.43% 861.97MB 13.64% github.com/ethereum/go-ethereum/vendor/github.com/syndtr/goleveldb/leveldb/util.(BufferPool).Get
708.04MB 11.21% 43.63% 708.04MB 11.21% math/big.nat.setBytes
529.06MB 8.37% 52.00% 529.06MB 8.37% reflect.unsafe_New
509.52MB 8.06% 60.07% 1251.55MB 19.81% github.com/ethereum/go-ethereum/rlp.decodeBigInt
453.14MB 7.17% 67.24% 464.65MB 7.35% github.com/ethereum/go-ethereum/trie.(hasher).store
388MB 6.14% 73.38% 388MB 6.14% github.com/ethereum/go-ethereum/vendor/github.com/syndtr/goleveldb/leveldb/memdb.New
264.43MB 4.18% 77.56% 264.43MB 4.18% github.com/ethereum/go-ethereum/vendor/gopkg.in/fatih/set%2ev0.(Set).Add
228.51MB 3.62% 81.18% 482.26MB 7.63% github.com/ethereum/go-ethereum/eth.(peer).SendTransactions
203.02MB 3.21% 84.39% 203.02MB 3.21% github.com/ethereum/go-ethereum/event.(*Feed).Send
````

Metrics

Last month:
screen shot 2018-05-10 at 7 17 03 pm

Last 2 months:
screen shot 2018-05-10 at 7 19 15 pm

Additional info

For reference, this host was migrated to run:
v1.8.2 on March 5
v1.8.3 on March 27
v1.8.6 on Apr 26
v1.8.6 with PR #16576 on May 1
v1.8.7 on May 4

We run our node with the following cmd line arguments --rpc --rpcapi "debug,personal,eth,web3" --targetgaslimit 1000000 --cache 2048 --gpopercentile 30 --maxpeers 25

On May 1st we added --metrics --pprof --pprofaddr=127.0.0.1 and started tracking various metrics. We thought this was the source of the problem so earlier today we restarted the node without the additional flags, but unfortunately the memory leak happened again.

This is the memory breakdown the last 48 hours:
screen shot 2018-05-10 at 7 25 28 pm

And since upgrading to 1.8.7:
screen shot 2018-05-10 at 7 26 39 pm

Source

vogelito

👍4 ❤1

Most helpful comment

This is the node's memory consumption since Friday morning.

screen shot 2018-05-12 at 10 59 04 pm

1.8.3 seems to be doing better, but if the issue is indeed caused by an uptick in txs, then it's possible things only look like this because it's the weekend.

vogelito on 13 May 2018

❤2

All 40 comments

Note: geth is basically the only thing running on this box and we have made sure the memory increases in the box are all geth. The sawtooth shape in the memory graphs are geth restarts.

vogelito on 11 May 2018

We are experiencing a similar issue:

Geth started in docker container on Ubuntu with:

docker run \
    --memory=8g \
    -v /mnt/ethereum:/root/.ethereum \
    -p 8545:8545 \
    -p 30303:30303 \
    -p 6060:6060 \
    -d \
    ethereum/client-go:v1.8.7 \
    --pprof --pprofaddr=0.0.0.0 --rpc --rpcapi=eth,personal,txpool,debug,admin,net --rpcaddr=0.0.0.0 --cache=1024 --datadir=/root/.ethereum --gpopercentile=50 --syncmode=full --gcmode=archive --rpcvhosts=*

We introduced the 8GB memory limit as work around for docker to auto restart the container.

Output of memStats (time in UTC+2)
screen shot on may 11th at 10_23 am

Output of /debug/pprof/heap at 2018-05-10 17:00 UTC+2:
geth_heap_20180510_1515

Output of /debug/pprof/heap at 2018-05-11 02:00 UTC+2:
geth_heap_20180511_0005

corverroos on 11 May 2018

👍2

We just downgraded to v1.8.6 with PR #16576. Will keep you posted

vogelito on 11 May 2018

+1, we are also experiencing this issue.

mykolasmith on 11 May 2018

Confirmed that 1.8.6 with PR #16576 also sees this issue. We're downgrading to 1.8.3 which is the last stable version we ran.

@ryanschneider suspects that the memory issues were always there (at least in 1.8.x) but have been exacerbated by the recent uptick in txs.

vogelito on 11 May 2018

❤2

@vogelito Just wondering, are you using filter queries? (experimental feature)

mykolasmith on 12 May 2018

@mykolasmith - no we are most definitely not.

vogelito on 12 May 2018

This is the node's memory consumption since Friday morning.

screen shot 2018-05-12 at 10 59 04 pm

1.8.3 seems to be doing better, but if the issue is indeed caused by an uptick in txs, then it's possible things only look like this because it's the weekend.

vogelito on 13 May 2018

❤2

Update on memory usage. 1.8.3 still doing ok:

vogelito on 14 May 2018

@karalabe - I saw some memory-management related improvements on the latest v1.8.8 release
Do you think there's a chance they address the memory-related problems we're seeing here?

vogelito on 14 May 2018

Unfortunately not.

karalabe on 14 May 2018

@vogelito Out of curiosity. Could you try reverting https://github.com/ethereum/go-ethereum/pull/16494 locally for yourself from master and seeing if that fixes it. I don't see how it would, just playing guessing game as to what we've changed in 1.8.4 that might cause this.

karalabe on 14 May 2018

@karalabe - happy to. However, I was planning on waiting a bit more to see if the problem is indeed gone in 1.8.3

We ran 1.8.6 without problems for a while until there was an uptick in network transactions.

vogelito on 14 May 2018

Another interesting experiment might be to run 1.8.7 or 1.8.8 with --pprof, and when you see the memory spiking, load http://localhost:6060/memsize and generate a report (Scan "node"). This will use some experimental memory stats gathering written by @fjl to try and report how much memory different parts of our code use. That might help pinpoint what part is leaky.

karalabe on 14 May 2018

Do note however that 'scanning all objects and measuring their memory use' will probably hit the node hard for 30+ seconds. So, take care with doing this on a production system.

karalabe on 14 May 2018

Happy to do whichever one you find more useful, just let me know. (and thanks for the heads up!)

vogelito on 14 May 2018

I can confirm that this is still an issue in 1.8.3. We just started seeing the leak happening on 1.8.3 today.

screen shot 2018-05-17 at 12 41 15 pm

We are restarting the node on 1.8.8 and will run with --pprof

vogelito on 17 May 2018

our node (now running 1.8.8) is not exhibiting the parabolic memory usage behavior, yet. However, memory usage is already quite high (~14GB) and this is the report.

screen shot 2018-05-17 at 3 09 27 pm

Edit: for some reason this message wasn't posted when I first intended to, so please check the date timestamp on the image.

vogelito on 18 May 2018

The node is now at 18.69 and this is what the report shows now:
screen shot 2018-05-17 at 6 59 27 pm

vogelito on 18 May 2018

last 24 hours running 1.8.8.

Process killed by the OS:

# dmesg | grep oom
[14196605.693635] geth invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[14196605.693664]  [<ffffffff81192e72>] oom_kill_process+0x202/0x3c0
[14196605.693750] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

vogelito on 18 May 2018

We're running a few tests ourselves too now. A PR from Gary https://github.com/ethereum/go-ethereum/pull/16720 could help with the tons of goroutines by making less of them and also fixes some pending block generation issues that made tx posting slow. I can't give you meaningful numbers since the syncs just finished now, but lets see how the two behave in comparison after a day or so.

After this is in, I also have an update lined up that removes a blockage from block and transaction propagation into the network. That should avoid any goroutine leak caused by slow remote peers. Will set up a benchmark for it on Monday (want to run the above PR until then).

karalabe on 18 May 2018

👍1

Thanks. I will try to provide you with a memory report of the node later today. If there's anything else I can do to help you track this problem, please let me know.

vogelito on 18 May 2018

@karalabe We're going to try this out as well. Will report back results in a few hours!

mykolasmith on 18 May 2018

Current situation and memory report

screen shot 2018-05-18 at 3 16 42 pm

screen shot 2018-05-18 at 3 20 22 pm

vogelito on 18 May 2018

Update. Node exhausted memory again.

Memory scan
screen shot 2018-05-19 at 5 06 32 am

Last month
screen shot 2018-05-19 at 5 03 59 am

Last week

screen shot 2018-05-19 at 5 04 15 am

Last 2 days
screen shot 2018-05-19 at 5 04 41 am

Last 24 hours
screen shot 2018-05-19 at 5 05 14 am

vogelito on 19 May 2018

screen shot 2018-05-19 at 12 22 21 pm

Still seeing memory leak as of https://github.com/ethereum/go-ethereum/pull/16720.

mykolasmith on 19 May 2018

Per @holiman's post in #16673, I'm attaching the output of http://localhost:6060/debug/pprof/goroutine?debug=1

This was with 1.8.8 at over 24GB. The node had 2.3M goroutines on the event feed (99.9934% of all goroutines).

goroutine profile: total 2364583
2364427 @ 0x44272a 0x4427de 0x419c42 0x4198fb 0x6c1fec 0x4706e1
#   0x6c1feb    github.com/ethereum/go-ethereum/event.(*Feed).Send+0xcb /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133

17 @ 0x44272a 0x4427de 0x453344 0x45305d 0x4871b8 0x4880bd 0x886b44 0x8868ce 0xa8f821 0xa8f0a6 0xa9d90b 0x985ce6 0x4706e1
#   0x45305c    sync.runtime_SemacquireMutex+0x3c                   /usr/local/go/src/runtime/sema.go:71
#   0x4871b7    sync.(*Mutex).Lock+0x107                        /usr/local/go/src/sync/mutex.go:134
#   0x4880bc    sync.(*RWMutex).Lock+0x2c                       /usr/local/go/src/sync/rwmutex.go:93
#   0x886b43    github.com/ethereum/go-ethereum/core.(*TxPool).addTxs+0x53      /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:801

The full output file here

vogelito on 20 May 2018

Master currently contains some optimizations that should help avoid one potential bottleneck from the transaction feed. Trying that might be a good data point for us.

On top of master, I have a PR https://github.com/ethereum/go-ethereum/pull/16769 which removes another potential congestion points from the transaction feed. Trying out this might also be a valuable data point.

No promise that these will fix the issue, but is should most definitely fix some issue that could lead to this scenario in insanely busy nodes.

Quick question. Could you detail a bit your setup? How many peers do you run with? What is your API usage? What kind of requests are you serving? Anything else that might be helpful to find congested pathways?

karalabe on 21 May 2018

I see that #16769 has been merged into master. I'll try a new build off of master and report back.

Regarding our setup, here are the details:

startup args: --rpc --rpcapi "debug,personal,eth,web3" --targetgaslimit 1000000 --cache 2048 --gpopercentile 30 --maxpeers 25 --metrics --pprof --pprofaddr=127.0.0.1
We heavily use the JSONRPC API with the following calls:
- eth_blockNumber: Continuously
- eth_getBlockByNumber: Once a new block we haven't seen is mined
- eth_getTransactionReceipt: Low volume (on transactions we care about)
- eth_getTransactionByHash: Low volume (on transactions we care about)
- eth_getBalance: twice / minute
- eth_gasPrice: varies but constantly
- personal_unlockAccount: varies but constantly
- eth_sendTransaction: varies but constantly
- personal_newAccount: Low volume

I hope this is useful. We're restarting our node now with a build from master and will report back.

vogelito on 21 May 2018

For reference, this is what we're running now

# /home/ubuntu/geth/current/geth version
Geth
Version: 1.8.9-unstable
Git Commit: 415969f534d8122e717e18abafdc8ec7bb913a84
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

vogelito on 21 May 2018

I've been having this issue on both 1.8.1 and 1.8.8, but it only started appearing recently (last couple of weeks I think) and has gotten to a point where I have to restart geth on a daily basis to prevent it from crashing unexpectedly. FWIW I dont think 1.8.x should be considered 'stable' until this is addressed.

crypt1d on 21 May 2018

@crypt1d I don't think it's 1.8.x being unstable, rather the network usage pattern is changing and surfacing a dormant bug.

karalabe on 21 May 2018

👍1

Things are looking better but perhaps still too early to tell.

Reminder:

We downgraded to 1.8.3 on Friday May 11th and it ran "ok(ish)" until Thursday May 17th.
We upgraded to 1.8.8 on Thursday May 17th and we had to restart our node daily until this morning (May 21st) when we upgraded to master on 415969f534d8122e717e18abafdc8ec7bb913a84

Memory graph of last week:
screen shot 2018-05-21 at 3 30 42 pm

vogelito on 21 May 2018

Daily update: Memory still looking ok here.

vogelito on 22 May 2018

screen shot 2018-05-22 at 12 15 50 pm

Memory looking OK for us as well.

mykolasmith on 22 May 2018

Daily update: Memory still looking ok here.

vogelito on 23 May 2018

Update: memory seems to have been stable as of https://github.com/ethereum/go-ethereum/pull/16769, however we experienced several spikes in memory usage prior to 18:00EST yesterday, after which all peers began to timeout and the node never recovered.

screen shot 2018-05-23 at 2 55 20 pm

screen shot 2018-05-23 at 2 55 14 pm

May 23 14:46:04 geth[22809]: WARN [05-23|14:46:04] Synchronisation failed, dropping peer peer=4994163696be9f52 err=timeout May 23 15:51:21 geth[22809]: WARN [05-23|15:51:21] Synchronisation failed, dropping peer peer=90bccbf1fccc80ae err=timeout May 23 16:49:58 geth[22809]: WARN [05-23|16:49:58] Synchronisation failed, dropping peer peer=6011df96954d3708 err=timeout

mykolasmith on 23 May 2018

Things still looking OK for us memory-wise and no known other issues at the moment.