Go-ethereum: Geth 1.8.15 - Memory Leak?

Created on 11 Sep 2018 · 15Comments · Source: ethereum/go-ethereum

System information

Geth version:

Version: 1.8.15-stable
Git Commit: 89451f7c382ad2185987ee369f16416f89c28a7d
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10

OS & Version:

CPU model            : Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Number of cores      : 2
CPU frequency        : 2500.000 MHz
Total size of Disk   : 49.0 GB 
Total amount of Mem  : 3891 MB 
Total amount of Swap : 0 MB 
OS                   : Ubuntu 16.04.5 LTS
Arch                 : x86_64 (64 Bit)
Kernel               : 4.4.0-1067-aws
----------------------------------------------------------------------

Expected behaviour

Geth runs smoothly with normal and stable RAM usage.

Actual behaviour

It started normal around 30% RAM usage. Slowly, it jumped high until it crashed around 90% RAM usage.

Steps to reproduce the behaviour

Command:

$ /usr/bin/geth --nodiscover --syncmode 'fast' --cache=512 --rpc --rpcaddr=0.0.0.0 --rpcapi='db,eth,net,web3,personal,admin' --rpccorsdomain='*' --ws --wsaddr=0.0.0.0 --wsapi='db,eth,net,web3,personal,admin' --wsorigins='*' --mine --minerthreads='1'

FYI, I'm running 2 nodes private blockchain. Both machines are on same specs as per above. Each node has 50GB EBS volume and 4GB RAM. They are 't3.medium' type EC2 on AWS.

I didn't do anything to the node during the recording below. No extra load was sent to the node i.e. HTTP RPC call, geth attach & etc. Just mining, syncing with the second node and htop on another terminal.

I did try running the same command above in background mode and same issue happened. I noticed that Geth stopped after ~10 minutes. My SSH session stucked when it was at the peak of RAM usage.

Backtrace

Does this issue related to https://github.com/ethereum/go-ethereum/issues/16728 and https://github.com/ethereum/go-ethereum/issues/16859?

Can someone suggest the most stable version for me?

Source

zulhfreelancer

All 15 comments

My testing indicates that 1.8.13 is stable and 1.8.15 has some sort of a problem.
I have a test system with 1.8.15 that has run as high as 45GB of memory with only a few thousand transactions. I am running 1.8.15-unstable.

pschlump on 12 Sep 2018

@pschlump thank you for the tip. Few questions for you:

What is the specs for that 1.8.13 node (CPU, RAM and disk)?
Do you have any guide/Gist for Geth uninstallation process?
Are you syncing main net / test net or private net?

Thanks.

zulhfreelancer on 12 Sep 2018

I have a private test net - the 2 machines with Geth running on it have 96GB of memory, quad Xeon - 2*2TB hard drives. They are isolated from main-net with a hardware fire wall. Purely test systems.

My process for down-grade of the test systems - I used docker to bring up 1.8.13 nodes - one on each system and let them sync. Then I just shutdown the 1.8.15-unstable. Then I brought up 2 new nodes with 1.8.13 and shutdown the docker containers.

I can confirm that the 1.8.13 version is stable - and - not leaking. When I bring up a 1.8.15 in docker it grows until I kill it.

pschlump on 12 Sep 2018

I have not tried 1.8.14 - I will try that today in a docker container.

pschlump on 12 Sep 2018

I have run our distributed key generation application (Keep/thesis*) on 1.8.13 - and geth grows by 1.1mb of memory then goes back down in a few minutes (good behavior). On 1.8.15 it grew by 2.3 GB! I am setting up a 1.8.14 version now.

pschlump on 12 Sep 2018

My tests indicate that 1.8.14 is ok - the problem is with .15.

pschlump on 12 Sep 2018

On a test-network? Behind a firewall? Why?

On Wed, Sep 12, 2018 at 5:03 PM a e r t h notifications@github.com wrote:

--rpc --rpcaddr=0.0.0.0 --rpcapi='db,eth,net,web3,personal,admin' should
be illegal or something

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ethereum/go-ethereum/issues/17646#issuecomment-420826653,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAhMQZB5kZ9dO7zNN9KozLKeOgvbnJHBks5uaZLMgaJpZM4WkRPX
.

--
Philip Schlump

pschlump on 13 Sep 2018

Thank you @pschlump for the pointers. I will give it a try.

zulhfreelancer on 13 Sep 2018

On those logfiles, the first one had 10 simultaneous Unlock operations going on, and the second has 7. The mem on the machine is 3891 MB. I have run into issues on unlock a _single_ key on a usb armory, which as 500Mb. So I would suspect it's the decryption going on that's causing it to crash.

You could try using --lightkdf settings for the keystores, that will make it take a lot less memory. See https://github.com/ethereum/go-ethereum/blob/d9575e92fc6e52ba18267410fcd2426d5a148cbc/accounts/keystore/keystore_passphrase.go#L55 and https://github.com/ethereum/go-ethereum/blob/0e32989a08b8b84e7fe4ae397fe4302e93e34782/cmd/utils/flags.go#L181

I have no idea why it would differ between versions though. But I guarantee that Unlock takes hundreds of MB of memory even on the older builds, but maybe for some reason it finished them faster and they didn't pile up to become paralell, which causes the crash

holiman on 14 Sep 2018

Oh, and if it wasn't you calling Unlock, then it's some attacker spuriously trying to do some bruteforce password guessing against your node.

holiman on 14 Sep 2018

--rpc --rpcaddr=0.0.0.0 --rpcapi='db,eth,net,web3,personal,admin' should be illegal or something

On a test-network? Behind a firewall? Why?

I guess your firewall is not properly configured, so this ticket demonstrates a pretty good reason :)

holiman on 14 Sep 2018

I think I was the source of the unlocks on my system. I have looked in my logs from my firewall and I see no evidence that any unexpected outside activity took place. I am now looking into the possibility that somebody unwanted has penetrated our security and has malicious code running inside our firewall. I don't see any unexpected pending transactions and I am monitoring once a second for pending transactions. I take your comment very seriously.

pschlump on 14 Sep 2018

I managed to resolve this problem by disabling the 'db' rpc API. Not sure if it's the same root cause as you guys, but the behavior seems similar to mine.