Nano-node: Change storage technology

Created on 25 Jan 2018 · 26Comments · Source: nanocurrency/nano-node

Why

Currently one of the largest costs of running a Raiblocks node is due to the large amount of IO needed just to keep up with the current write rate.

General Info

Disks can do ~75 to 100 iops per second, or 120 Megabytes per second (sequential io).
Consumer SSD's can do ~10K iops, or 375 Megabytes per second of IO.

Problem Description

Bootstrapping currently requires ~1k iops and 3 megabytes/s. So LMDB is generating a lot of very small writes for every block, but it's not actually writing much data. The write rate would be easily done on a single spinning disk if the IO's were structured differently.

That's not ideal for this usecase where we are more concerned with being able to sustain a large write rate. There's a very large temporal distribution of data; newer data is more likely to be read while old data is less likely to be read. So we should choose a data-storage technology that allows for very cheap writes, has relatively cheap reads on recent data, and can scale to large amounts of data.

LMDB is a memory mapped B-Tree. It makes for some very very fast random reads; however it's expensive for writes.

Log Structured Merge Trees however have the exact properties that we're looking for. See: The advantages of an LSM vs a B-Tree

Log structured merge trees allow writes to come in at a fantastic rate, and only generate a small amount of larger IO's. So we should think about replacing LMDB with a log structured merge tree. The best in breed currently is RocksDB. It also has the added advantage that it can compress blocks.

Suggested Solution

Add RocksDB
Add ZSTD
Configure RocksDB with universal compaction,
Add a flag to allow using RockDB.
After it's all tested and shown to be working remove the LMDB code.

Steps to reproduce the issue:

Start a new rai node
Run iostat -dxt 5
Notice the very very small IO's being issued.

Environment:

cpu:
                       Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1995 MHz
                       Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1950 MHz
                       Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1733 MHz
                       Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 2444 MHz
storage:
                       Intel 8 Series SATA Controller 1 [AHCI mode]
network:
  eno1                 Intel Ethernet Connection I218-V
network interface:
  eno1                 Ethernet network interface
  lo                   Loopback network interface
  docker0              Ethernet network interface
  veth8964baa          Ethernet network interface
disk:
  /dev/sda             Crucial_CT120M50
partition:
  /dev/sda1            Partition
  /dev/sda2            Partition
  /dev/sda3            Partition

logs

01/24/2018 07:39:03 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.40    0.00  777.60     0.00  3742.40     9.63     3.77    4.85    0.00    4.85   0.04   3.04

01/24/2018 07:39:08 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.40    0.00  868.20     0.00  4256.80     9.81     4.18    4.82    0.00    4.82   0.04   3.44

01/24/2018 07:39:13 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.40    0.00  696.60     0.00  3480.00     9.99     3.49    5.00    0.00    5.00   0.04   2.88

01/24/2018 07:39:18 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.40    0.00  586.40     0.00  2907.20     9.92     3.03    5.17    0.00    5.17   0.04   2.48

01/24/2018 07:39:23 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.60    0.00  490.60     0.00  2472.00    10.08     2.54    5.18    0.00    5.18   0.04   2.08

Source

elliottneilclark

Most helpful comment

How does the data.ldb compare to the backup file produced with vacuum?

The copy is 2.54GB, not sure why that didn't show up the first time. Yeah this is a fresh sync so less fragmentation.

How about reads?

I haven't gotten reads hooked up yet. My plan is:

Hook up the random reads (all gets)
Hook up the iterators.
Get stats working.
Get tests passing
Clean up the store api.
Then look at the wallet

Right now since everything is hooked up double the IO is not any better. It will be a little bit until I have a good perf comparison.

elliottneilclark on 28 Jan 2018

👍3

All 26 comments

LSM's are indeed great for writes. However, I think LMDB can work well with better batching. Stuff like https://github.com/clemahieu/raiblocks/pull/222, and we could look into environment sharding, as well as a no-sync option like some other cryptos using LMDB supports.

As far as reliability goes, lmdb's author has a (biased) opinion: https://www.reddit.com/r/Monero/comments/4rdnrg/lmdb_vs_rocksdb/d51egcs/?st=jcucg6q0&sh=f4667a64

I personally think any future lmdb swapout should consider libmdbx (which is basically lmdb++)

cryptocode on 25 Jan 2018

👍1

I am very skeptical that LMDB will ever scale well with the current workload. Anything B-Tree will struggle massively with the incoming write rate that's needed to keep up with #493 . The data being written is all keyed off of keys with lots and lots of entropy. So every time a new leaf is added to the tree the mutations will be spread out through the whole tree. Batching will not really help this at all. After the tree is large enough (meaning after bootstrapping), there will be almost no chance that two edits need to change the same parts of the tree. Hence batching will not gain much; it might reduce the two inserts into one, but the many changes to the btree will not be made faster at all.

As for the reliability claims, I'll just say I strongly disagree with the lmdb author, and leave it at that.

If you feel strongly that you want to go with LMDB (or something derived from it), I'll be very interested in seeing what works and what doesn't.

elliottneilclark on 25 Jan 2018

If you feel strongly that you want to go with LMDB

@elliottneilclark For the record, I'm not in the position to decide anything :) I'm just another contributor chiming in. I'm fairly confident that none of the KV stores available is a panacea. LMDB is rock solid and extremely portable, and that matters though.

Hence batching will not gain much;

I think benchmarking will have to decide that - the db is clearly not used optimally at the moment. And then there's the topic of sharding the environment. At the very least splitting the wallets and the ledger.

Trying RocksDB would be interesting IMO, but it would have to be a lot more robust than LevelDB ever was.

Considering the backing it has, I'm sure it is though ;)

Perhaps the ideal is to introduce a KV store layer so backends can be plugged in easily and benchmarked properly. Users/apps/exchanges could then pick whatever suits them best. A lot of work though.

cryptocode on 25 Jan 2018

👍1

Trying RocksDB would be interesting IMO, but it would have to be a lot more robust than LevelDB ever was.

RocksDB is very solid.

Perhaps the ideal is to introduce a KV store layer so backends can be plugged in easily and benchmarked properly

That was my initial thought, but then I started looking at it and lmdb transactions have pervaded the apis. So it's probably more work than I can take on to make it fully pluggable. However I am hoping to try a quick proof of concept for rocksdb.

elliottneilclark on 26 Jan 2018

👍2

On 26 Jan 2018, at 7:10 AM, Elliott notifications@github.com wrote:

Trying RocksDB would be interesting IMO, but it would have to be a lot more robust than LevelDB ever was.

RocksDB is very solid.

Perhaps the ideal is to introduce a KV store layer so backends can be plugged in easily and benchmarked properly

That was my initial thought, but then I started looking at it and lmdb transactions have pervaded the apis. So it's probably more work than I can take on to make it fully pluggable. However I am hoping to try a quick proof of concept for rocksdb.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub https://github.com/clemahieu/raiblocks/issues/540#issuecomment-360631464, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad6ZPo0P3s8MNqsIcFIwiISmSzAKELxWks5tOQnXgaJpZM4RsTAO.

@Elliott Just do it, let it happen. Obviously raiblocks needs you guys to make it more perfect.

learnforpractice on 26 Jan 2018

https://github.com/elliottneilclark/raiblocks/tree/rocksdb

My CMake is pretty rusty, but it compiles. So we're on the way.

elliottneilclark on 26 Jan 2018

@elliottneilclark Compiling is a great start ;)

Not sure about choosing zstd over lz4 though. So while zstd yields better ratios, decompression is 3x slower and almost twice as slow at compression (actually LZ4 fast 8 pulls off 900 MB/s)

A pruned node is probably not going to have much redundancy btw, so maybe make it optional.

cryptocode on 26 Jan 2018

👍1

Yeah ZSTD is slower; choosing between LZ4 and ZSTD will be one of the things we'll have to test later. I'm going with my usual default and then we can see if we need more compression/decompression speed.

elliottneilclark on 26 Jan 2018

Sounds great

cryptocode on 26 Jan 2018

Current tip of the rocksdb branch in my github has double writing. I'm currently resynching and seeing how that goes.

elliottneilclark on 27 Jan 2018

I ran a node last night that was double writing to RocksDB and to lmdb. This morning after everything was synced up this is what I saw for size:

65M     /home/eclark/RaiBlocks/accounts
18M     /home/eclark/RaiBlocks/blocks_info
18M     /home/eclark/RaiBlocks/change_blocks
22M     /home/eclark/RaiBlocks/checksum
4.0K    /home/eclark/RaiBlocks/config.json
3.0G    /home/eclark/RaiBlocks/data.ldb
4.0K    /home/eclark/RaiBlocks/data.ldb-lock
64M     /home/eclark/RaiBlocks/frontiers
5.9M    /home/eclark/RaiBlocks/log
18M     /home/eclark/RaiBlocks/meta
69M     /home/eclark/RaiBlocks/open_blocks
76M     /home/eclark/RaiBlocks/pending
325M    /home/eclark/RaiBlocks/receive_blocks
40M     /home/eclark/RaiBlocks/representation
397M    /home/eclark/RaiBlocks/send_blocks
129M    /home/eclark/RaiBlocks/unchecked
152K    /home/eclark/RaiBlocks/unsynced
40M     /home/eclark/RaiBlocks/vote
4.2G    total

LMDB has about 3Gigs of space used, while RocksDB has 1.2gigs. A 60% space savings.

elliottneilclark on 27 Jan 2018

@elliottneilclark that seems about right, --vacuum tends to shave off 60% as well

cryptocode on 27 Jan 2018

The size is the same before and after --vacuum. So the 60% savings is more likely due to compression

65M     /home/eclark/RaiBlocks/accounts
18M     /home/eclark/RaiBlocks/blocks_info
18M     /home/eclark/RaiBlocks/change_blocks
22M     /home/eclark/RaiBlocks/checksum
4.0K    /home/eclark/RaiBlocks/config.json
3.0G    /home/eclark/RaiBlocks/data.ldb
4.0K    /home/eclark/RaiBlocks/data.ldb-lock
64M     /home/eclark/RaiBlocks/frontiers
5.9M    /home/eclark/RaiBlocks/log
18M     /home/eclark/RaiBlocks/meta
69M     /home/eclark/RaiBlocks/open_blocks
76M     /home/eclark/RaiBlocks/pending
325M    /home/eclark/RaiBlocks/receive_blocks
40M     /home/eclark/RaiBlocks/representation
397M    /home/eclark/RaiBlocks/send_blocks
129M    /home/eclark/RaiBlocks/unchecked
152K    /home/eclark/RaiBlocks/unsynced
40M     /home/eclark/RaiBlocks/vote
4.2G    total

elliottneilclark on 27 Jan 2018

How about reads? I wonder if it is producing as many hard faults (like in the referenced issue above) during sync as the lmdb version..? All the random reads are causing glitches in audio among other things for some users, me among them. Nice work by the way!

jbe on 28 Jan 2018

@elliottneilclark that's weird, most people see a huge space savings with --vacuum, though maybe not so much on a fresh sync. How does the data.ldb compare to the backup file produced with vacuum? It's not on your list.

cryptocode on 28 Jan 2018

How does the data.ldb compare to the backup file produced with vacuum?

The copy is 2.54GB, not sure why that didn't show up the first time. Yeah this is a fresh sync so less fragmentation.

How about reads?

I haven't gotten reads hooked up yet. My plan is:

Hook up the random reads (all gets)
Hook up the iterators.
Get stats working.
Get tests passing
Clean up the store api.
Then look at the wallet

Right now since everything is hooked up double the IO is not any better. It will be a little bit until I have a good perf comparison.

elliottneilclark on 28 Jan 2018

👍3

I added on random reads, and pushed to my branch. Re-syncing to make sure that everything works. Then I'll look at iterators.

That should be the last bit before I can remove writing to lmdb and get a comparison.

elliottneilclark on 31 Jan 2018

@elliottneilclark Are you using TransactionDB or optimistic locking to replace lmdb tx?

cryptocode on 31 Jan 2018

I think it's worthwhile to look at anything that gives better performance. It's almost certain there are optimizations that can be done at a code level to solve some issues: we don't want to prematurely optimize.

That being said your analysis of the tree based structure issues with high entropy data is right on and we should look at these alternatives and back them with real world benchmarks.

I think the atomic property lmdb is very beneficial, from a support standpoint I almost never hear of someone having a corrupt database that needs to be reinitialized. People kill processes and machines all the time and I think if we lost this property we should definitely do a lot of thinking about whether we want to make that change.

Looks interesting so far though!

clemahieu on 1 Feb 2018

I think the atomic property lmdb is very beneficial

Yeah anything that we look at has to have atomic writes and not corrupt the database on crash or machine failure. That is just minimum bar for any good data storage technology. RocksDB has atomic writes and shouldn't corrupt on anything short of kernel level bug, or hardware failure.

Are you using TransactionDB or optimistic locking to replace lmdb tx?

Right now none at all. Though once I get the performance testing done I am planning to use opportunistic locking for transactions. I'm holding off on doing that since it will be a huge change to the apis; I want to make sure the performance wins are there before looking at large changes like that.

elliottneilclark on 1 Feb 2018

I want to make sure the performance wins are there before looking at large changes like that.

Can you tell if there will be a benefit before adding transactions though? I assume opportunistic locking isn't free.

Not sure how easy it is, but maybe use locking in a few localized spots to get some early measurements?

Very interesting to see if there's significant performance benefits in switching.

cryptocode on 1 Feb 2018

Can you tell if there will be a benefit before adding transactions though?

Transactions will make reasoning about visitor rollbacks easier. Other than that it will not be great.

I assume opportunistic locking isn't free.

It's not; however opportunistic locking is very very cheap. So while it will be a little bit more cpu, I doubt it will change anything from a performance perspective.

elliottneilclark on 2 Feb 2018

It's not; however opportunistic locking is very very cheap.

As long as you don't hit highly contended hotspots, which probably isn't the case here.

cryptocode on 2 Feb 2018

Instabilities in nano have made me wary of making large changes. There's almost no chance that I can prove I haven't broken things with so many core parts of nano in flux. As such I'll close this and watch for some other time that it might be viable.

In the end I got only blocks working, I ended up being able to bootstrap with less than a 100 iops per second. Meaning I could bootstrap on a hard disk and still be fine. So there is a HUGE win there if someone else wants to do the work.

elliottneilclark on 1 Mar 2018

I’ll definitely be referencing this in the future, I’m going to take a look at HDD performance as soon as I can. It’s been an issue of time constraints, not lack of interest.

clemahieu on 1 Mar 2018

👍2

It would be a big win, I just can't finish it right now. I've lost time other places so hitting a deadline :-/

Things I would change from what I have in the last version if given more time:

make settings configurable
Use one db with lots of different column families.
Remove the specialized iterators, and instead create one templated iterator that has a method for RocksDB result to type (union_256 etc).

elliottneilclark on 1 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings