Currently one of the largest costs of running a Raiblocks node is due to the large amount of IO needed just to keep up with the current write rate.
Disks can do ~75 to 100 iops per second, or 120 Megabytes per second (sequential io).
Consumer SSD's can do ~10K iops, or 375 Megabytes per second of IO.
Bootstrapping currently requires ~1k iops and 3 megabytes/s. So LMDB is generating a lot of very small writes for every block, but it's not actually writing much data. The write rate would be easily done on a single spinning disk if the IO's were structured differently.
That's not ideal for this usecase where we are more concerned with being able to sustain a large write rate. There's a very large temporal distribution of data; newer data is more likely to be read while old data is less likely to be read. So we should choose a data-storage technology that allows for very cheap writes, has relatively cheap reads on recent data, and can scale to large amounts of data.
LMDB is a memory mapped B-Tree. It makes for some very very fast random reads; however it's expensive for writes.
Log Structured Merge Trees however have the exact properties that we're looking for. See: The advantages of an LSM vs a B-Tree
Log structured merge trees allow writes to come in at a fantastic rate, and only generate a small amount of larger IO's. So we should think about replacing LMDB with a log structured merge tree. The best in breed currently is RocksDB. It also has the added advantage that it can compress blocks.
Steps to reproduce the issue:
Environment:
cpu:
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1995 MHz
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1950 MHz
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1733 MHz
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 2444 MHz
storage:
Intel 8 Series SATA Controller 1 [AHCI mode]
network:
eno1 Intel Ethernet Connection I218-V
network interface:
eno1 Ethernet network interface
lo Loopback network interface
docker0 Ethernet network interface
veth8964baa Ethernet network interface
disk:
/dev/sda Crucial_CT120M50
partition:
/dev/sda1 Partition
/dev/sda2 Partition
/dev/sda3 Partition
logs
01/24/2018 07:39:03 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 777.60 0.00 3742.40 9.63 3.77 4.85 0.00 4.85 0.04 3.04
01/24/2018 07:39:08 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 868.20 0.00 4256.80 9.81 4.18 4.82 0.00 4.82 0.04 3.44
01/24/2018 07:39:13 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 696.60 0.00 3480.00 9.99 3.49 5.00 0.00 5.00 0.04 2.88
01/24/2018 07:39:18 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 586.40 0.00 2907.20 9.92 3.03 5.17 0.00 5.17 0.04 2.48
01/24/2018 07:39:23 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.60 0.00 490.60 0.00 2472.00 10.08 2.54 5.18 0.00 5.18 0.04 2.08
LSM's are indeed great for writes. However, I think LMDB can work well with better batching. Stuff like https://github.com/clemahieu/raiblocks/pull/222, and we could look into environment sharding, as well as a no-sync option like some other cryptos using LMDB supports.
As far as reliability goes, lmdb's author has a (biased) opinion: https://www.reddit.com/r/Monero/comments/4rdnrg/lmdb_vs_rocksdb/d51egcs/?st=jcucg6q0&sh=f4667a64
I personally think any future lmdb swapout should consider libmdbx (which is basically lmdb++)
I am very skeptical that LMDB will ever scale well with the current workload. Anything B-Tree will struggle massively with the incoming write rate that's needed to keep up with #493 . The data being written is all keyed off of keys with lots and lots of entropy. So every time a new leaf is added to the tree the mutations will be spread out through the whole tree. Batching will not really help this at all. After the tree is large enough (meaning after bootstrapping), there will be almost no chance that two edits need to change the same parts of the tree. Hence batching will not gain much; it might reduce the two inserts into one, but the many changes to the btree will not be made faster at all.
As for the reliability claims, I'll just say I strongly disagree with the lmdb author, and leave it at that.
If you feel strongly that you want to go with LMDB (or something derived from it), I'll be very interested in seeing what works and what doesn't.
If you feel strongly that you want to go with LMDB
@elliottneilclark For the record, I'm not in the position to decide anything :) I'm just another contributor chiming in. I'm fairly confident that none of the KV stores available is a panacea. LMDB is rock solid and extremely portable, and that matters though.
Hence batching will not gain much;
I think benchmarking will have to decide that - the db is clearly not used optimally at the moment. And then there's the topic of sharding the environment. At the very least splitting the wallets and the ledger.
Trying RocksDB would be interesting IMO, but it would have to be a lot more robust than LevelDB ever was.
Considering the backing it has, I'm sure it is though ;)
Perhaps the ideal is to introduce a KV store layer so backends can be plugged in easily and benchmarked properly. Users/apps/exchanges could then pick whatever suits them best. A lot of work though.
Trying RocksDB would be interesting IMO, but it would have to be a lot more robust than LevelDB ever was.
RocksDB is very solid.
Perhaps the ideal is to introduce a KV store layer so backends can be plugged in easily and benchmarked properly
That was my initial thought, but then I started looking at it and lmdb transactions have pervaded the apis. So it's probably more work than I can take on to make it fully pluggable. However I am hoping to try a quick proof of concept for rocksdb.
On 26 Jan 2018, at 7:10 AM, Elliott notifications@github.com wrote:
Trying RocksDB would be interesting IMO, but it would have to be a lot more robust than LevelDB ever was.
RocksDB is very solid.
Perhaps the ideal is to introduce a KV store layer so backends can be plugged in easily and benchmarked properly
That was my initial thought, but then I started looking at it and lmdb transactions have pervaded the apis. So it's probably more work than I can take on to make it fully pluggable. However I am hoping to try a quick proof of concept for rocksdb.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub https://github.com/clemahieu/raiblocks/issues/540#issuecomment-360631464, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad6ZPo0P3s8MNqsIcFIwiISmSzAKELxWks5tOQnXgaJpZM4RsTAO.
@Elliott Just do it, let it happen. Obviously raiblocks needs you guys to make it more perfect.
https://github.com/elliottneilclark/raiblocks/tree/rocksdb
My CMake is pretty rusty, but it compiles. So we're on the way.
@elliottneilclark Compiling is a great start ;)
Not sure about choosing zstd over lz4 though. So while zstd yields better ratios, decompression is 3x slower and almost twice as slow at compression (actually LZ4 fast 8 pulls off 900 MB/s)
A pruned node is probably not going to have much redundancy btw, so maybe make it optional.
Yeah ZSTD is slower; choosing between LZ4 and ZSTD will be one of the things we'll have to test later. I'm going with my usual default and then we can see if we need more compression/decompression speed.
Sounds great
Current tip of the rocksdb branch in my github has double writing. I'm currently resynching and seeing how that goes.
I ran a node last night that was double writing to RocksDB and to lmdb. This morning after everything was synced up this is what I saw for size:
65M /home/eclark/RaiBlocks/accounts
18M /home/eclark/RaiBlocks/blocks_info
18M /home/eclark/RaiBlocks/change_blocks
22M /home/eclark/RaiBlocks/checksum
4.0K /home/eclark/RaiBlocks/config.json
3.0G /home/eclark/RaiBlocks/data.ldb
4.0K /home/eclark/RaiBlocks/data.ldb-lock
64M /home/eclark/RaiBlocks/frontiers
5.9M /home/eclark/RaiBlocks/log
18M /home/eclark/RaiBlocks/meta
69M /home/eclark/RaiBlocks/open_blocks
76M /home/eclark/RaiBlocks/pending
325M /home/eclark/RaiBlocks/receive_blocks
40M /home/eclark/RaiBlocks/representation
397M /home/eclark/RaiBlocks/send_blocks
129M /home/eclark/RaiBlocks/unchecked
152K /home/eclark/RaiBlocks/unsynced
40M /home/eclark/RaiBlocks/vote
4.2G total
LMDB has about 3Gigs of space used, while RocksDB has 1.2gigs. A 60% space savings.
@elliottneilclark that seems about right, --vacuum tends to shave off 60% as well
The size is the same before and after --vacuum. So the 60% savings is more likely due to compression
65M /home/eclark/RaiBlocks/accounts
18M /home/eclark/RaiBlocks/blocks_info
18M /home/eclark/RaiBlocks/change_blocks
22M /home/eclark/RaiBlocks/checksum
4.0K /home/eclark/RaiBlocks/config.json
3.0G /home/eclark/RaiBlocks/data.ldb
4.0K /home/eclark/RaiBlocks/data.ldb-lock
64M /home/eclark/RaiBlocks/frontiers
5.9M /home/eclark/RaiBlocks/log
18M /home/eclark/RaiBlocks/meta
69M /home/eclark/RaiBlocks/open_blocks
76M /home/eclark/RaiBlocks/pending
325M /home/eclark/RaiBlocks/receive_blocks
40M /home/eclark/RaiBlocks/representation
397M /home/eclark/RaiBlocks/send_blocks
129M /home/eclark/RaiBlocks/unchecked
152K /home/eclark/RaiBlocks/unsynced
40M /home/eclark/RaiBlocks/vote
4.2G total
How about reads? I wonder if it is producing as many hard faults (like in the referenced issue above) during sync as the lmdb version..? All the random reads are causing glitches in audio among other things for some users, me among them. Nice work by the way!
@elliottneilclark that's weird, most people see a huge space savings with --vacuum, though maybe not so much on a fresh sync. How does the data.ldb compare to the backup file produced with vacuum? It's not on your list.
How does the data.ldb compare to the backup file produced with vacuum?
The copy is 2.54GB, not sure why that didn't show up the first time. Yeah this is a fresh sync so less fragmentation.
How about reads?
I haven't gotten reads hooked up yet. My plan is:
Right now since everything is hooked up double the IO is not any better. It will be a little bit until I have a good perf comparison.
I added on random reads, and pushed to my branch. Re-syncing to make sure that everything works. Then I'll look at iterators.
That should be the last bit before I can remove writing to lmdb and get a comparison.
@elliottneilclark Are you using TransactionDB or optimistic locking to replace lmdb tx?
I think it's worthwhile to look at anything that gives better performance. It's almost certain there are optimizations that can be done at a code level to solve some issues: we don't want to prematurely optimize.
That being said your analysis of the tree based structure issues with high entropy data is right on and we should look at these alternatives and back them with real world benchmarks.
I think the atomic property lmdb is very beneficial, from a support standpoint I almost never hear of someone having a corrupt database that needs to be reinitialized. People kill processes and machines all the time and I think if we lost this property we should definitely do a lot of thinking about whether we want to make that change.
Looks interesting so far though!
I think the atomic property lmdb is very beneficial
Yeah anything that we look at has to have atomic writes and not corrupt the database on crash or machine failure. That is just minimum bar for any good data storage technology. RocksDB has atomic writes and shouldn't corrupt on anything short of kernel level bug, or hardware failure.
Are you using TransactionDB or optimistic locking to replace lmdb tx?
Right now none at all. Though once I get the performance testing done I am planning to use opportunistic locking for transactions. I'm holding off on doing that since it will be a huge change to the apis; I want to make sure the performance wins are there before looking at large changes like that.
I want to make sure the performance wins are there before looking at large changes like that.
Can you tell if there will be a benefit before adding transactions though? I assume opportunistic locking isn't free.
Not sure how easy it is, but maybe use locking in a few localized spots to get some early measurements?
Very interesting to see if there's significant performance benefits in switching.
Can you tell if there will be a benefit before adding transactions though?
Transactions will make reasoning about visitor rollbacks easier. Other than that it will not be great.
I assume opportunistic locking isn't free.
It's not; however opportunistic locking is very very cheap. So while it will be a little bit more cpu, I doubt it will change anything from a performance perspective.
It's not; however opportunistic locking is very very cheap.
As long as you don't hit highly contended hotspots, which probably isn't the case here.
Instabilities in nano have made me wary of making large changes. There's almost no chance that I can prove I haven't broken things with so many core parts of nano in flux. As such I'll close this and watch for some other time that it might be viable.
In the end I got only blocks working, I ended up being able to bootstrap with less than a 100 iops per second. Meaning I could bootstrap on a hard disk and still be fine. So there is a HUGE win there if someone else wants to do the work.
I’ll definitely be referencing this in the future, I’m going to take a look at HDD performance as soon as I can. It’s been an issue of time constraints, not lack of interest.
It would be a big win, I just can't finish it right now. I've lost time other places so hitting a deadline :-/
Things I would change from what I have in the last version if given more time:
Most helpful comment
The copy is 2.54GB, not sure why that didn't show up the first time. Yeah this is a fresh sync so less fragmentation.
I haven't gotten reads hooked up yet. My plan is:
Right now since everything is hooked up double the IO is not any better. It will be a little bit until I have a good perf comparison.