Jormungandr: Excessive network/memory usage

Created on 29 Oct 2019 · 11Comments · Source: input-output-hk/jormungandr

Describe the bug
jormungandr process uses up all my bandwidth and syncs very slowly.

Mandatory Information

jcli 0.7.0-rc1 (HEAD-5583e5e, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]
jormungandr 0.7.0-rc1 (HEAD-5583e5e, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]

To Reproduce
Use the following config:

log:
  format: "plain"
  level: "info"
  output: "stderr"
p2p:
  listen_address: "/ip4/0.0.0.0/tcp/3100"
  public_address: "/ip4/<A PUBLIC IP HERE>/tcp/3100"
  topics_of_interest:
    blocks: "high"
    messages: "high"
  trusted_peers:
    - address: "/ip4/13.230.137.72/tcp/3000" 
      id: "ed25519_pk1w6f2sclsauhfd6r9ydgvn0yvpvg4p3x3u2m2n7thknwghrfpdu5sgvrql9"
    - address: "/ip4/13.230.48.191/tcp/3000"
      id: "ed25519_pk1lzrdh0pcmhwcnqdl5cgcu7n0c76pm7g7p6pdey7wup54vz32gy6qlz5vnq"
    - address: "/ip4/18.196.168.220/tcp/3000" 
      id: "ed25519_pk1uufkgu0t9xm8ry04wnddtnku5gjg8typf5z6ehh65uc6nz4j8n4spq0xrl"
    - address: "/ip4/3.124.132.123/tcp/3000" 
      id: "ed25519_pk14tqkqnz3eydn0c8c8gmmyzxgnf2dztpy5dnrx09mhfzv0dh93s3qszqgpc"
    - address: "/ip4/18.184.181.30/tcp/3000" 
      id: "ed25519_pk178ge2jn6c40vgmrewgmg26nmtda47nk2jncukzj327mp3a9g2qzss2d44f"
    - address: "/ip4/184.169.162.15/tcp/3000" 
      id: "ed25519_pk1nk0ne8ez66w5tp2g8ctcakthjpz89eveyg0egcpylenhet83n0sq2jqz8q"
    - address: "/ip4/13.56.87.134/tcp/3000" 
      id: "ed25519_pk1ce450zrtn04eaevcn9csz0thpjuhxrysdrq6qlr9pq7e0wd842nsxy6r5k"
rest:
  listen: "127.0.0.1:3101"
storage: "./storage"

Start the node with:

RUST_BACKTRACE=full ./jormungandr --config ./node-config.yaml --genesis-block-hash ae57995b8fe086ba590c36dc930f2aa9b52b2ffa92c0698fff2347adafe8dc65

The node syncs very slowly, but network activity is super high (especially outgoing), even though the node doesn't seem to be doing anything useful. Memory usage jumps from 600mb to 4gb in a few hours, and keeps growing over time.

Expected behavior

jormungandr process should not be a resource hog.

bug subsys-network

Source

jbax

Most helpful comment

Thanks guys, we believe rc4 has and the following releases will have improvements on this. We already noticed bandwidth improvements recently.

NicolasDP on 3 Nov 2019

❤3

All 11 comments

Update with new version:

jcli 0.7.0-rc2 (HEAD-d6de99e, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]
jormungandr 0.7.0-rc2 (HEAD-d6de99e, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]

Node is now syncing properly and faster than before. Memory/Network consumption is still an issue. Uptime only 3000 and memory usage already at 7GB. See my status:

blockRecvCnt: 2558
lastBlockDate: "32.2795"
lastBlockFees: 0
lastBlockHash: cd54f7c4d8f3195590bc0551d08b7c90acb2765dce52c36ef1be8c86db616ac5
lastBlockHeight: "13972"
lastBlockSum: 0
lastBlockTime: "2019-10-30T00:58:26+00:00"
lastBlockTx: 0
state: Running
txRecvCnt: 2
uptime: 3029

At that rate I'll have to restart the node every 6 hours to prevent my system from running out of memory.

Regarding network usage, it's maxing out my up/down speeds (2 mb/s down, 1 mb/s up). Do we need that much bandwidth at this stage?

jbax on 30 Oct 2019

Same problem for me, after about 14-15 hours of running the node stopped, probably because it was using 96% of my RAM (I have 20GB of RAM), here are my logs, including strace-capture and load-statistics. Also the node maxed out the upload-bandwidth. Logs from v0.7.0-rc2

memleak.zip

bjarnekvae on 30 Oct 2019

@NicolasDP, could you please label with resource-usage label?

@bjarnekvae, what bandwidth do you have?

@jbax, for the testnet it should not be required, but for Shelley mainnet, according to current spec, block size can be up-to 2M bytes (a bit less than 2 MiB); not sure if that includes header which has the same size specification, if blocks are still to be produced every 20s, then the required MINIMAL upload bandwidth for a pool will be 5,7 Mb/s.

mark-stopka on 30 Oct 2019

@mark-stopka 12.5 MB/s down, 2.5 MB/s up

bjarnekvae on 30 Oct 2019

@bjarnekvae let's keep it in networking terms please -> 100 Mb/s down, 20 Mb/s up :). I've never had this issue on my larger VM with 32 GiB of memory, 300 Mb/s down, 80 Mb/s up currently configured QoS on the core router.

I've had out-of-memory issues on 2 GiB VM, I just upgraded that to 4 GiB and observing.

Pushing about 30 Mb/s of traffic out.

mark-stopka on 30 Oct 2019

Seems to be quite consistent for me, same thing just happened again.

bjarnekvae on 31 Oct 2019

@bjarnekvae, could you use ulimit -Sv to reduce memory to let's say 4 GiB and see if it crashes with that faster / at all? I would then guide you on how to get a core dump and other relevant data.

BTW just to confirm, you are using a storage backend and not in-memory storage, right?

mark-stopka on 31 Oct 2019

I've also been experiencing excessive resource utilization over time. Essentially, the memory is slowly consumed until the server crashes. I'm using the storage ( storage: "./storage") parameter in my node configuration, and have confirmed that it's being created and written to. Looking at my resource monitor, it seems as though jormungandr isn't flushing its chain data to disk properly, or there is a memory leak. Of course, that's a little speculative without a little more hard proof :) Here are some details:

Server Setup

8 GB Memory / 60 GB Disk / Ubuntu 18.04.3 (LTS) x64
jormungandr 0.7.0-rc3 (HEAD-466c0fb, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]

I've attaced a screenshot of my resource monitor that illustrates this.

cardano-7rc3-resources

Linicks on 1 Nov 2019

Thanks guys, we believe rc4 has and the following releases will have improvements on this. We already noticed bandwidth improvements recently.

NicolasDP on 3 Nov 2019

❤3

This is no longer an issue for me, been running rc7 for 48 hours and jormungandr is using ~120 MB of RAM, almost no CPU and and never seen it use more than 50 KB/s of bandwith up or down. Good job! :)

bjarnekvae on 11 Nov 2019

Thanks. I think it's time to close this issue

NicolasDP on 11 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

node panicked - "cannot process leadership block" - cluster with 2 nodes on local pc; v0.7.0-rc4

dorin100 · 4Comments

Jcli panics when priting to stdout

mmahut · 5Comments

blockRecvCnt is often higher than lastBlockHeight

mark-stopka · 3Comments

parallel IPv4/6 support and future backup links

gufmar · 4Comments

Trusted peer responded with different node id

dorin100 · 3Comments