Lisk-sdk: Memory leaks in the core application/postgres

Created on 16 Apr 2018 · 6Comments · Source: LiskHQ/lisk-sdk

Expected behavior

The application/Postgres memory usage should be consistent, this ensures the garbage collection is happening consistently and the heap never grows to avoid the memory leak.

Actual behavior

The application consumes all the memory when the network(200 nodes) is syncing and forging the blocks, this causes the forging nodes to crash and the network doesn't grow.

Steps to reproduce

Run 200 nodes(lisk-core) and enable forging on 10 nodes and sync all other nodes, post all types of transactions and at one particular point the heap starts to grow and leads to memory leaks and node crash.

Which version(s) does this affect? (Environment, OS, etc...)

1.0.0

bug

Source

ManuGowda

Most helpful comment

@MaciejBaj Here is the details observation of Memory leak/ bloat during block processing in the local environment.

Setup:

[x] Lisk-core running on local environment with syncing disabled, broadcasting disabled and only enabling forging.
[x] Running node application in inspect mode to capture the heap snapshot and setting max_old_space parameter to 500MB or 1000MB for different tests. and also enabling expose_gc parameter to observer the garbage collection, here is an example node-inspect --trace_gc --max_old_space_size=1000 --expose_gc app.js
[x] Then run the stress test of transactions type(0, 1, 2, 3, 4, 5) using lisk-core-qa for the Memory leak/ bloat observations.
[x] Here is the g-drive link which contains the heap snapshots captured during heap_used grows extensively over 50MB consistently.
[x] After much detailed investigation on process memory usage and the garbage collection process, here is the conclusion I could arrive at. Before that, I would like to establish my understanding of memory usage and garbage collection to give the conclusion. Memory usage of the Node.js process consists of Resident Set Size(RSS), is the amount of space occupied in the main memory device (that is a subset of the total allocated memory) for the process, which includes the heap, code segment, and stack. heapTotal and heapUsed refer to V8's memory usage. external refers to the memory usage of C++ objects bound to JavaScript objects managed by V8.
Theheap is where objects, strings, and closures are stored. Variables are stored in the stack and the actual JavaScript code resides in the code segment.
[x] Conclusion made so far: The HeapUsed and HeapTotal memory is consumed when the stress test was run and when the memory spike goes up the garbage collector is running the scavenge for garbaging collecting in the new_space and later point of time mark-sweep is run to collect the old_space memory. So this makes one thing clear that there is no memory leak in during the forging and even syncing process(Thanks to @nazarhussain for running this test and conforming). And also @jondubois also confirmed in this test against socketcluster is leaking any memory. However, the only thing which needs to be evaluated is the Resident Set Size(RSS) which is growing constantly(RSS grew 1.3GB when the max_old_space_size was set to 1GB), so doing further investigation on RSS to conclude finally what caused the node to restart every time on the betanet
[ ] TODO: Investigating the RSS memory growth

ManuGowda on 19 Apr 2018

👍3

All 6 comments

@ManuGowda While spending time on debugging, make sure that we are looking at a memory leak not a memory bloat.

nazarhussain on 16 Apr 2018

👍3

@ManuGowda Why performance label?

4miners on 16 Apr 2018

@MaciejBaj Here is the details observation of Memory leak/ bloat during block processing in the local environment.

Setup:

[x] Lisk-core running on local environment with syncing disabled, broadcasting disabled and only enabling forging.
[x] Running node application in inspect mode to capture the heap snapshot and setting max_old_space parameter to 500MB or 1000MB for different tests. and also enabling expose_gc parameter to observer the garbage collection, here is an example node-inspect --trace_gc --max_old_space_size=1000 --expose_gc app.js
[x] Then run the stress test of transactions type(0, 1, 2, 3, 4, 5) using lisk-core-qa for the Memory leak/ bloat observations.
[x] Here is the g-drive link which contains the heap snapshots captured during heap_used grows extensively over 50MB consistently.
[x] After much detailed investigation on process memory usage and the garbage collection process, here is the conclusion I could arrive at. Before that, I would like to establish my understanding of memory usage and garbage collection to give the conclusion. Memory usage of the Node.js process consists of Resident Set Size(RSS), is the amount of space occupied in the main memory device (that is a subset of the total allocated memory) for the process, which includes the heap, code segment, and stack. heapTotal and heapUsed refer to V8's memory usage. external refers to the memory usage of C++ objects bound to JavaScript objects managed by V8.
Theheap is where objects, strings, and closures are stored. Variables are stored in the stack and the actual JavaScript code resides in the code segment.
[x] Conclusion made so far: The HeapUsed and HeapTotal memory is consumed when the stress test was run and when the memory spike goes up the garbage collector is running the scavenge for garbaging collecting in the new_space and later point of time mark-sweep is run to collect the old_space memory. So this makes one thing clear that there is no memory leak in during the forging and even syncing process(Thanks to @nazarhussain for running this test and conforming). And also @jondubois also confirmed in this test against socketcluster is leaking any memory. However, the only thing which needs to be evaluated is the Resident Set Size(RSS) which is growing constantly(RSS grew 1.3GB when the max_old_space_size was set to 1GB), so doing further investigation on RSS to conclude finally what caused the node to restart every time on the betanet
[ ] TODO: Investigating the RSS memory growth

ManuGowda on 19 Apr 2018

👍3

A good reference on the current issue we are tackling:
https://github.com/nodejs/node/issues/12805
https://github.com/nodejs/node/issues/13917
https://groups.google.com/forum/#!topic/nodejs/KM0Yis-LNpg
https://github.com/nodejs/node/issues/11077

ManuGowda on 19 Apr 2018

I debug the low level memory usage with valgrind with following options;

--leak-check=full --show-leak-kinds=all --trace-children=yes

and found following summary while doing syncing of blocks from network for 15 minutes;

==2901== LEAK SUMMARY:
==2901==    definitely lost: 728 bytes in 1 blocks
==2901==    indirectly lost: 704 bytes in 5 blocks
==2901==      possibly lost: 1,640 bytes in 10 blocks
==2901==    still reachable: 1,429,553 bytes in 6,664 blocks
==2901==                       of which reachable via heuristic:
==2901==                         stdstring          : 61 bytes in 1 blocks
==2901==                         newarray           : 49,880 bytes in 46 blocks
==2901==         suppressed: 0 bytes in 0 blocks

Detail for definite leaks are:

==2901== 1,432 (728 direct, 704 indirect) bytes in 1 blocks are definitely lost in loss record 1,146 of 1,228
==2901==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2901==    by 0xC421CB7: void createGroup<true>(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/nazar/lisk/node_modules/uws/uws_linux_48.node)
==2901==    by 0x98F7D1: v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) (in /home/nazar/.nvm/versions/node/v6.12.3/bin/node)
==2901==    by 0x9EE7FD: v8::internal::(anonymous namespace)::HandleApiCallHelper(v8::internal::Isolate*, v8::internal::(anonymous namespace)::BuiltinArguments<(v8::internal::BuiltinExtraArguments)3>) (in /home/nazar/.nvm/versions/node/v6.12.3/bin/node)
==2901==    by 0x9EF09D: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) (in /home/nazar/.nvm/versions/node/v6.12.3/bin/nod

So for sure there is a memory leak in uws library. Within 15 minutes it leaked 1.4kb, so as time grows that leak can grow. I suggest to switch from that dependent library.

nazarhussain on 20 Apr 2018

👍1

Closed by https://github.com/LiskHQ/lisk/pull/2018