Eos: mlock prevents usage of RAM + Swap tmpfs for chain state db

Created on 9 Jun 2018  路  5Comments  路  Source: EOSIO/eos

A recent commit to chainbase here:
https://github.com/EOSIO/chainbase/commit/2f0bbe484bba5233fb02fac2930c1fa9cce38cd3
has applied a call to mlock to attempt to lock the shared_memory file into RAM. Unfortunately, this actually prevents usage of a shared_memory greater than RAM size, as this call will _crash the program_, not fail gracefully.

This hinders the usage of "Cheap, Light Nodes" that, for example, use 64GB of main memory, and 64GB of swap on NVME, allowing for a ~128GB shared_memory file.

To reproduce: Simply try launching with a chain-state-db-size-mb = greater than physical RAM.
Expected behaviour: shared_memory file should exist despite being > RAM and use swap space when needed.

enhancement

Most helpful comment

That assumes that every request goes to disk.
What you will find happening is that "Hot" data finds itself in the real memory, while "Cold" data gets paged out to the swap file. This natural process of hot and cold data would be discovered by the operating system (albeit not as good a solution that does it based on the data, but it works /good enough/). It would only become a problem when too much data needs to be fed from disk.

There is a ton of research in database literature on "hot" and "cold" data, and I encourage a read of this to understand why we do not need to keep "cold" data in main memory at all times. A good starting point is F4, which contains many useful references and great ideas on datastores:
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-muralidhar.pdf

All 5 comments

On what OS or configuration are you running where mlock produces a fatal crash as opposed to a warning message?

Ubuntu 18.04. Fresh install.
When it hits this section of the code, you can watch htop and you can see the memory start to fill, until it uses all memory and the process is killed.
In the previous versions of chainbase, the use of swap works perfectly fine: for example, on steem you can easily run a node with 32 gb of memory even though the shared_memory file is >40gb already.

In addition, removing the mlock call allows the system to function with a large chain state specified, as expected.

The SSD access time has latency of 100us, which is 1000x slower than RAM access time. I don't think using swap file can catch up with other nodes in real-time given the unpredictable location of state access. However, there maybe some workaround in the future.

That assumes that every request goes to disk.
What you will find happening is that "Hot" data finds itself in the real memory, while "Cold" data gets paged out to the swap file. This natural process of hot and cold data would be discovered by the operating system (albeit not as good a solution that does it based on the data, but it works /good enough/). It would only become a problem when too much data needs to be fed from disk.

There is a ton of research in database literature on "hot" and "cold" data, and I encourage a read of this to understand why we do not need to keep "cold" data in main memory at all times. A good starting point is F4, which contains many useful references and great ideas on datastores:
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-muralidhar.pdf

Was this page helpful?
0 / 5 - 0 ratings