cosmos-sdk 🚀 - What to do about IAVL?

I think there at two separate issues:

efficient storage -- this can be done with RocksDB or something similar. Libra, for example is using RocksDB for their storage module - in fact it's a wrapper on top of RocksDB to deal with the serialization of keys and values (we need to to store multiple types of data, and key-value pairs in RocksDB are byte arrays). This wrapper enforces that all data in and out of the DB is structured according to predefined schemas.
Storage proofs / storage commitments. To prove the state transition we need to be able to prove transaction inputs and outputs. This is usually done by committing to whole state using a Merkle Root.

robert-zaremba on 19 Aug 2020

Depending on the use-case it would be good to have closer look at Merkle Trees alternatives for storage commitments. I will review the following papers and write a summary next week.

Pointproofs: Aggregating Proofs for Multiple Vector Commitments. It provides non-interactive aggregation of proofs across multiple commitments. This construction enables any third party to aggregate a collection of proofs with respect to different, independently computed commitments into a single proof represented by an elliptic curve point of 48-bytes.
I believe this could be very useful for light clients and all sorts of batch processes.
Merkle Mountain Range -- Merkle tree optimized for lists of sequentially appended data and its proofs.

robert-zaremba on 19 Aug 2020

I think we're explicitly avoiding RocksDB for various reasons. I do agree we should explore alternatives in logical storage. MMR, sparse merkle trees, and urkle trees all look like good contenders. One thing to keep in mind is that we need:

prefix iteration
range proofs
inclusion/exclusion proofs

alexanderbez on 19 Aug 2020

IAVL is very problematic in it's IO patterns. After Amino, it's probably the largest Cosmos SDK performance bottleneck. It also uses a lot of indirection that makes look ups very inefficient.

zmanian on 19 Aug 2020

@alexanderbez thanks for comment. Is there any resource summarizing your decision about RocksDB?
Also, what's the use case for doing prefix iteration?

robert-zaremba on 19 Aug 2020

Prefix iteration is used extensively throughout the SDK. It's how we do index scans.

aaronc on 19 Aug 2020

@marbar3778 mentioned the other day that our aim to be database-agnostic is also causing an additional burden on state tree update optimisation. may want to consider limiting database support in conjunction with this issue to leverage database-native efficiencies.

cc @musalbas if you have thoughts on use of jellyfish here

hxrts on 20 Aug 2020

👍3

I think we're explicitly avoiding RocksDB for various reasons.

rocksdb is supported. I would not say we are avoiding it because we have done next to zero database configuration testing. It is still unclear if goleveldb is better for the sdk than others (badgerdb, boltdb, rocksdb). There are a couple users who do use rocksdb.

Cosmos-SDK, IAVL, and Tendermint has support for 5 databases. Being that the Cosmos-SDK and hopefully Tendermint are considered frameworks for building blockchains, an agnostic approach to supported databases was taken. This is nice but causes some headaches in optimizations. We can provide sane defaults for various dbs but this leads to more testing of which we have none of.

I do agree with:

I think there at two separate issues:

Efficient storage should be a separate issue and have its own discussion. I opened an issue (https://github.com/cosmos/cosmos-sdk/issues/6406) to discuss supporting fewer dbs based on testing that needs to be done. This can be a start.

marbar3778 on 20 Aug 2020

👍1

Ohh I never said it wasn't supported, just from my experience, there are much better alternatives (e.g. LevelDB, Badger, etc..).

One thing to keep in mind as well is that there are libraries out there that are very well supported, tested, and benchmarked. Continuing to have to maintain IAVL especially when it's sub-optimal at best, is probably not the best strategy.

I would be great if someone would take lead on exploring alternatives.

alexanderbez on 20 Aug 2020

Generally, it seems that the blockchain space is heading towards Sparse Merkle trees (SMT) for state trees - including Ethereum 2.0 (although it seems now they're leaning towards polynomial commitments), Libra (jellyfish), LazyLedger plus also Google's Trillian. However, all of these projects are using slightly different designs for the tree, and so these implementations are not necessarily interchangeable. Libra and LazyLedger use a compute-optimised SMT by replacing subtrees with only default values with a single node; Ethereum 2.0 plans to use a compute-optimised tree by using a more efficient hash function with a standard SMT; and Trillian uses a standard unoptimised SMT (afaik).

At LazyLedger we have settled on using the Libra SMT, and implemented it in Golang. Libra has an implementation in rust called jellyfish, but it is slightly different as it operates on "nibbles" (where each key is split into sections of 4-bytes each). In LazyLedger's SMT implementation the storage code is abstracted away as a "MapStore" interface, so devs can define their own storage backend.

I'm not sure about the implementation complexity of IAVL trees, but I think SMTs are relatively simple and shouldn't be difficult to maintain. The core of the LazyLedger SMT implementation is 400 lines of code. Although we store each node in the tree in the DB, so I'm not sure if that addresses the IO performance concerns.

musalbas on 20 Aug 2020

I think Ethereum 2.0 big bet is on polynomial commitments.

robert-zaremba on 20 Aug 2020

👍1

@musalbas - why did you settled on jellyfish?

robert-zaremba on 20 Aug 2020

We're not using jellyfish, we're using our own SMT library, that adopts the same SMT design as specified in the Libra whitepaper. We chose this as it's the simplest tree design we're aware of with a O(log(k)) get/update cost. Vitalik's scheme using an optimised hash function is simpler, but there were some concerns around the security of this, and it still requires a hash operation every 16 levels in the tree.

musalbas on 20 Aug 2020

Does Jellyfish or LL's compute-optimized SMT allow for prefix iteration/range queries? If so, that seems like a solid bet.

alexanderbez on 20 Aug 2020

No I don't think so, because the keys are hashed, and so they should appear in random places in the tree. A range query would thus simply consist of multiple Merkle proofs.

musalbas on 20 Aug 2020

Prefix iteration can be done separately. It doesn't need to be done on the storage commitment unless you need proofs / commitments while doing the iteration. Range queries can be implemented with prefix iteration.

Where the the range proofs are used? What's the use-case for them? Maybe this can be solved with aggregateable commitments (like the aforementioned pointproofs or RSA accumulators).

robert-zaremba on 20 Aug 2020

👍2

@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries. However, I don't know what you mean by A range query would thus simply consist of multiple Merkle proofs. @musalbas? Can we avoid hashing the keys before insertion?

alexanderbez on 20 Aug 2020

@alexanderbez , @musalbas - braking a range query into single queries for merkle proofs (assuming this is needed) sounds very inefficient.

robert-zaremba on 20 Aug 2020

@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries. However, I don't know what you mean by A range query would thus simply consist of multiple Merkle proofs. @musalbas? Can we avoid hashing the keys before insertion?

Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.

If you want to prove membership of multiple keys in the tree, that would require an individual Merkle proof per key, regardless if they are within the same "range".

It would be helpful to know where range proofs are being used for the state tree though. I understand why range proofs may be helpful for the Merkle tree of transactions, but not sure why they would be needed for the state tree.

musalbas on 20 Aug 2020

Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.

I see. I guess this isn't a problem with IAVL because it's self-balancing...

Since SMT hash keys prior to insertion like merkle-patricia trees do, I don't see how we'd have prefix iteration unless we can figure out how to leverage the underlying physical storage somehow :(

alexanderbez on 20 Aug 2020

@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries.

At present, we do not support range proofs (of more than 1 key) in the SDK. Since the subspace queries do not support returning proofs.
https://github.com/cosmos/cosmos-sdk/blob/master/store/iavl/store.go#L271

I brought it up in an issue a while ago here: https://github.com/cosmos/cosmos-sdk/issues/5241

And tried to do some work on it, but i couldn't get it to work and there were other more pressing blockers for IBC proofs so I dropped it and worked on those

AdityaSripal on 20 Aug 2020

Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.

I see. I guess this isn't a problem with IAVL because it's self-balancing...

Since SMT hash keys prior to insertion like merkle-patricia trees do, I don't see how we'd have prefix iteration unless we can figure out how to leverage the underlying physical storage somehow :(

I believe if you use the optimised hash function optimisation for SMTs instead of the Libra optimisation, you can have free choice over where keys are stored in the tree, and can thus support range proofs. cc @adlerjohn

musalbas on 20 Aug 2020

@erikgrinaker is on vacation, but when he's back it would be good to understand more about the current status of IAVL and what's needed to ensure the correctness of that implementation.

hxrts on 20 Aug 2020

Just adding another option here:
https://github.com/nomic-io/merk by @mappum

It uses RocksDB, supports checkpointing, is still an AVL tree so supports range queries, and has the pieces to support state syncing.

sunnya97 on 24 Aug 2020

@erikgrinaker is on vacation, but when he's back it would be good to understand more about the current status of IAVL and what's needed to ensure the correctness of that implementation.

As far as I know, IAVL is correct - it's just slow and inefficient, and not concurrency-safe. I think this is in part because it tries to do too much itself (e.g. versioning) which should ideally be offloaded to the underlying database engine. The access patterns are also suboptimal, requiring lots of random accesses - I'm not familiar with the state of the art in Merkle trees, but there's a reason databases use e.g. B+trees instead of binary trees.

Building a performant data store is non-trivial, and not something we should spend engineering resources on if we can avoid it in my opinion. I'd be happy to help explore alternatives if necessary.

erikgrinaker on 24 Aug 2020

👍4

@erikgrinaker the advantage of B+trees is that in general you have faster lookup (they are more compact). This is especially important for the storage layers, which are not very fast on random access, or where you can't efficiently use hardware caches.

robert-zaremba on 24 Aug 2020

Exactly. I suspect an AVL tree is inherently inefficient given block devices and CPU caches, but I don't know how Merkelization affects these tradeoffs.

erikgrinaker on 24 Aug 2020

In Merk, the Merkle tree structure doesn't affect the lookups (every node is indexed by its key rather than its hash), so there is practically no read overhead compared to plain RocksDB.

mappum on 24 Aug 2020

👍4 🚀2

Merk looks very promising! Do you know how it compares to IAVL? I'm sure much better, but just want to get a feel for raw numbers. Also, we'd need to probably port this to Go if there isn't already a version out there.

alexanderbez on 24 Aug 2020

Do you know how it compares to IAVL?

15x - 40x higher write throughput last time I compared on my MacBook (the lower end with none of the tree kept in memory, the higher end with all of it kept in memory).

Also, we'd need to probably port this to Go if there isn't already a version out there.

Someone did port it to Go: https://github.com/tak1827/merk-go, although I'm not sure what the status of that is, plus it might be easier to just make bindings for the Rust lib rather than keep the 2 in sync.

mappum on 24 Aug 2020

👍4

Yeah, I'd lean towards bindings. I'm also fantasizing about using wasm for cross-language libraries, but don't know if that's viable or performant yet.

erikgrinaker on 24 Aug 2020

It seams that the library is not very big.
Also I'm preparing a general overview and a spec doc to put together all requirements and see the best solution (maybe we can do better than Merkle Tree).

robert-zaremba on 24 Aug 2020

I'm still missing a piece about range proofs - where and why do we need it?
I see a need for bulk proofs / aggregated commitments -- for light client use cases and stateless clients.

robert-zaremba on 24 Aug 2020

Aren't range proofs a prerequisite for absence proofs, since these are really just range proofs over the neighbors of the absent key?

erikgrinaker on 24 Aug 2020

👍1

Aren't range proofs a prerequisite for absence proofs, since these are really just range proofs over the neighbors of the absent key?

Depending on definition and implementation. For me a range proof is a more an aggregated commitment, or a proof over a range of values.

Absence proof is a proof that a specific value is not part of a state. With Markle Trees absence proof is implemented by providing proof of two neighbors.

robert-zaremba on 24 Aug 2020

👍1

I'm still missing a piece about range proofs - where and why do we need it?

It makes proofs much more efficient (in terms of proof size, and CPU/IO for the node resolving the proof) when using data structures more advanced than just simple account maps. For example, you can split a large struct into multiple keys, and still get a proof of the fields you need without needing N completely separate Merkle branches.

Although that's just one optimization, the important part is that if you've implemented the tree to support range proofs you can also likely iterate by key in the on-chain logic (very significant for on-chain orderbooks, for example).

mappum on 24 Aug 2020

Not necessarily - it depends on the tree design. In SMTs, non-membership
proofs don't require range proofs, just a proof to show that the longest
prefix of the key leads to a leaf with an empty value.

On 24/08/2020 19:28, Erik Grinaker wrote:
>

Aren't range proofs a prerequisite for absence proofs, since these are
really just range proofs over the neighbors of the absent key?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-679292158,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABGOEBNRI7LGUJY4F6UPN7DSCKWM7ANCNFSM4QD7TDGA.

musalbas on 24 Aug 2020

👍2

It makes proofs much more efficient (in terms of proof size, and CPU/IO for the node resolving the proof) when using data structures more advanced than just simple account maps. For example, you can split a large struct into multiple keys, and still get a proof of the fields you need without needing N completely separate Merkle branches.

I've got that. But I don't see the use case where this will apply. Since the accounts are randomly distributed, then practically speaking, none set of proofs will represent a range. Instead, as I wrote above, I see the big use case for aggregated proofs.

robert-zaremba on 24 Aug 2020

Not sure why there needs to be an explicit distinction. In Merk any selection of keys can be included in a query, so it can be any combination of ranges or individual keys. https://github.com/nomic-io/merk/blob/develop/docs/algorithms.md#structure

But a concrete example for ranges we are working with is an order book, where we need to access the best orders (sorted by price). We store these orders with keys being the encoding of (price, address). Getting a proof of e.g. the 100 highest bids means getting a single range which will only have around log(n) hashes, whereas an aggregated proof of 100 completely random keys would be around 100 * log(n) hashes.

mappum on 25 Aug 2020

👍2

iterate by key in the on-chain logic (very significant for on-chain orderbooks, for example).

Many of the Cosmos SDK core modules make extensive use of this feature (basically anywhere you see an iterator, such as in governance or staking).

And, here's an example of the orderbook like @mappum suggests (https://github.com/sunnya97/sdk-dex-mvp/blob/master/x/orderbook/orderwalls.go).

sunnya97 on 25 Aug 2020

👍1

@blinky3713 might want to follow along here since FOAM had some specific needs from IAVL

hxrts on 26 Aug 2020

You can generically add key iterability to the leaf data structure, independently of the tree format for handling the relationship between leaves.

In all the usecases of key iterability I know of in the SDK, we know at app-writing time that a particular leaf wants key iterability.

ValarDragon on 30 Aug 2020

Thanks @hxrts. I don't really care what happens as long as someone maintains or ports the grpc interface with potobuf files that I worked really hard to revive and get merged. If that goes away I'm kind of at a loss of words as to how disappointed I will be after (1) writing this argument up and having multiple people from tendermint/cosmos sign off on it (at least implicitly) and (2) implementing it myself in this PR after multiple rounds of feedback, considering that I almost never write go and it was an ordeal for me.

If you are reading this and are unfamiliar with kepler, the haskell language version of the cosmos sdk, see here

blinky3713 on 1 Sep 2020

I was checking the SDK, and I couldn't find any usage of range proofs. The Order book example uses a storage iterator, not a range proof. And the storage iterator can be implemented on the storage layer, not on the state commitments layer.

We don't construct computation proofs, so the state transition is only validated through the block acceptance: re-computation of all transactions in the block and state commitment check.

Range proofs, or partial commitments / selective commitments could be very useful for stateless client though.

robert-zaremba on 2 Sep 2020

I've finalized a report to review potential solutions for storage and new commitment schema (today I added conclusions). I've put it on the Dropbox paper - it's much easier to read and comment then on Github PR (Markdown in diff format is not very friendly). Later I can move it to github if needed.
Main gals:

learn Pointproofs and FlyClient and see if we can use them
review other solutions we discussed above.

THE REPORT has 3 sections: the state commitments review, storage review and conclusions (personal recommendation).
Would love to hear your feedback.

robert-zaremba on 2 Sep 2020

🎉4

@mappum - I went through the Merk description (didn't go into the code). It looks interesting, however few things are not clear to me. I added them as TODO in the review I linked above. Would be great to finalize it:

Most of the current solutions we are analyzing separate value from the commitment scheme (so, for example the SMT stores only the required hashes and key parts).
Could you asses is if Merk is using less storage then SMT? For me it's opposite: optimized SMT stores only required nodes and can compress keys (we don't store same prefixes twice). Whereas Merk is storing whole keys all the time.
Is there any benchmark comparing Merk and SMT based commitment?

robert-zaremba on 2 Sep 2020

Could you asses is if Merk is using less storage then SMT? For me it's opposite: optimized SMT stores only required nodes and can compress keys (we don't store same prefixes twice). Whereas Merk is storing whole keys all the time.

I would assume the key overhead means Merk is using more storage, but IMO for the systems we're building, storage is low on the resource priority list. We can still achieve similar key compression for encoding proofs so there is not a higher bandwidth cost for queries/state sync.

Is there any benchmark comparing Merk and SMT based commitment?

Not that I know of, what operations are you interested in comparing?

mappum on 2 Sep 2020

For the tree storage - would be great to keep it small so it can easily fit in RAM.

Re benchmarks:
Most of the state commitment operations are related to adding and updating state elements.

robert-zaremba on 2 Sep 2020

I think Merk will likely be faster (at least for read operations, not sure about updates) because its implementation uses implementation optimisations that various SMT implementations don't make. However, I think that the optimisations that Merk makes are behind-the-scenes implementation optimisations that can be implemented for other trees like SMTs too, without changing the underlying data structure. For example, storing nodes by key rather than by hash, to reduce the number of database accesses required to get a key from up to 256 to 1. This doesn't seem to require changing the actual data structure of the tree, just how the tree is stored locally.

musalbas on 2 Sep 2020

For the tree storage - would be great to keep it small so it can easily fit in RAM.

For the in-memory tree representation, Merk could easily achieve the same key compression as an SMT, we're not bound to representing entries the same way as we do on disk: https://github.com/nomic-io/merk/issues/36

mappum on 2 Sep 2020

@musalbas , We would need to do simulation. I think it's hard to estimate the SMT part.

AVL & Merk tree expected lookup time is log(n), worst case is 2log(n) (n=number of keys)
AVL & Merk lookup time = 1 is only for exactly one key.
SMT expected lookup time (assuming we have correctly randomized keys), is also log(n).
SMT worst case lookup time is 256=log(N) (N=domain size) -- but that should happen only to few keys (if any).

Re: storing nodes by key: this will depend on the key convention, and an average key size.

For Merk is avg_key_size*n
For SMT it's n/c * log(n) for some constant c>1
So the question is: avg_key_size <? log(n)/c

robert-zaremba on 2 Sep 2020

👍1

For c=2.5 and n=2^25 we have avg_key_size <? 10
For c=2 and n=2^30 we have avg_key_size <? 15

I have no idea what the c could be except that it decreases when n increases.

robert-zaremba on 2 Sep 2020

An interesting paper has been recently published for storage commitments, which could be an alternative for SMT: Using Homomorphic hashes in coded blockchains

robert-zaremba on 5 Nov 2020

I see this issue's description already links to libra-jellyfish-merkle, but see also Compact Sparse Merkle Trees, which are a very similar idea:

https://osf.io/8mcnh/

tarcieri on 19 Dec 2020

I guess github references from discussions → issues don't work just yet, but there is some discussion of this wrt parallelism here: https://github.com/cosmos/cosmos-sdk/discussions/8134#discussioncomment-229639

Also here's an ethresearch thread on the compact SMT paper tony posted above https://ethresear.ch/t/compact-sparse-merkle-trees/3741

hxrts on 23 Dec 2020

👀1

I think under the propose separation of concerns (State storage as separate from storage commitments), it would be highly beneficial to consider storage solutions other than simple key/value stores. In particular classical SQL database systems.

For my own work I've been coming to the conclusion that it would be REALLY nice to have the state stored in an SQL database, which would allow for the use of multiple indexes and join queries in transaction processing. My understanding is that the mature SQL databases are extremely well supported, highly efficient, with built in support for caching, indexes, ACID compliance, complex queries, etc.

They may not be as fast for raw lookups as a simple key/value store, but I would bet any production grade SQL db would out perform IAVL/goleveldb (or any other Merkle tree on top of a kv store) for most modules. Especially any that make use of iterators, since AFAIK each step of the iterator requires (at minimum) another lookup (log(n)) in the nodedb. Whereas getting a series of results from an SQL table is done in a single query and is almost certainly O(log(n)) if the iterated key has an index.

An SQL DB would also trivialize the whole "MultiStore" process, by just using multiple tables, rather than a single tree with key prefixes. Table names could easily be prefixed by module name to create the same sort of module based isolation within a shared store.

The problem of storage commitments/proofs could then be dealt with separately, with the committed data being something like:
"tablename : serialization_of_row" (or the hash of same). The SQL DB could also be used as the backend for an SMT / Merkle based commitment scheme, though long term it would probably be most efficient to use a cryptographic accumulator which isn't directly tree based. I think this could also be made most efficient if updates to the store happen in transaction processing phases, while changes to the storage commitment are queued up and processed in a batch in the commit phase.

I am currently working on a python based proof of concept for implementing an SMT (with batch inserts) on top of sqlite. If that experiment goes well, and there is at least some support for this path, I will likely try to build such a storage service. It would probably be in Haskell, as I'm hoping to use this along with kepler.

I also think it's worth planning ahead of time for cross language usage. ie whatever storage system / storage commitment scheme the Cosmos-SDK ends up using should be accessible through a separate service (such as gRPC) so that other language SDK implementations can reuse the same service, and so that replacement services could be constructed later which satisfy the same interface.

jlandrews on 7 Jan 2021

👍7

We might just be the first blockchain framework that uses RDBMS 😆 . How would merkle proofs work? Would those come from the storage commitment layer? And if so, how is that tied to the relational/logical storage layer? i.e. how do you prove what's in SQL is actually there and part of a structure?

alexanderbez on 7 Jan 2021

How would merkle proofs work? Would those come from the storage commitment layer?

Yes, the idea would be for the service to keep the storage commitment layer in sync with what is stored in the database tables. I'm thinking of the storage commitment layer as an abstract cryptographic accumulator. ie something which acts like a Set, with insert/remove operations and a way to construct proofs of membership/non-membership which can be checked against either the accumulator itself, or some digest of it (root hash in the case of a merkle tree style accumulator).

And if so, how is that tied to the relational/logical storage layer? i.e. how do you prove what's in SQL is actually there and part of a structure?

To keep them in sync, at the end of every block, the storage layer would need to send a set of additions/deletions to update the accumulator with. These would be hashes of the serializations of table rows (ie what a CSV file stores), prefixed with the name of the table they were from. So for example:

ADD:
Hash("bank_accounts : *address1* , *coins*")
Hash("bank_accounts : *address2* , 0")
DELETE:
Hash("bank_accounts : *address2*, *coins*")

Would be the result of a full transfer from address2 to address1.

So proving that some value is in the accumulator would be equivalent to a proof that the specified table has a row with that value.

It's basically like having two stores, one which is fast for inserting data and lookups (sql). And another one which doesn't have to be able to return the inserted data at all, but which is fast for creating/verifying proofs that a value was inserted (accumulator).

jlandrews on 7 Jan 2021

This would also obviate the need for #7099 and #7098, and simplify #7097.

jlandrews on 8 Jan 2021

We might just be the first blockchain framework that uses RDBMS

@alexanderbez Chain Core was using Postgres circa 2016, FWIW. Seemed to work pretty well there, and had some nice cloud deployment options (e.g. Amazon Aurora)

tarcieri on 8 Jan 2021

😄3

Who would like to take a first stab at an ADR? There are a lot of stakeholders, but I suggest one person own the ADR and solicit further design criteria and feedback from those who have contributed to this thread. If we are indeed going with SMT, perhaps someone from LazyLedger since they already have an initial implementation?

hxrts on 8 Jan 2021

👍2

Yeah, we can surely take the first stab on an ADR and collect the input and feedback from the different stakeholders. @tzdybal started looking into replacing iavl with an SMT in the SDK and is interested in taking the lead on this :-)

A shared slack or discord channel with all parties involved could also be helpful. We'd like to understand the different use-cases and priorities better. Should we go ahead and create a shared slack/discord, too?

liamsi on 8 Jan 2021

❤1

Let's start a channel in Discord.

alexanderbez on 8 Jan 2021

👍1

Great! I can't start a discord channel myself but looking forward to it 👍🏼

BTW regarding that SQL / RDBMS discussion above:

We might just be the first blockchain framework that uses RDBMS

it might not be a "classical" blockchain but google's trillian also comes with a mysql based storage backend (among others). As far as I remember it is not meant to be used in production as it is rather slow but it works fine for testing and developing (and everything is authenticated as with the other backends).

Ideally, the APIs / interfaces we settle on would even allow RDBMS backends (although I'm skeptical about whether we'd want to recommend that).

liamsi on 8 Jan 2021

Also here's an ethresearch thread on the compact SMT paper tony posted above https://ethresear.ch/t/compact-sparse-merkle-trees/3741

@hxrts - when doing a report I went through that thread. IIRC, LL SMT is a realization of that idea.

robert-zaremba on 8 Jan 2021

👍2

@hxrts , @liamsi -- I spent good amount of time analyzing this issue and looking for a good solution. So let me outline the thoughts coming from researching this subject. Happy to contribute. If you want I can kick off and port the design described below into ADR.

With @aaronc we were thinking about SQL data base with regards to: https://github.com/cosmos/cosmos-sdk/issues/7099
With Vulcanize, we started a proof of concept for streaming data from Cosmos Node to Postgresql. In Regen I'm doing another MVP for building a whole relational database (also using Postgres) for higher level use.

For the usecase linked above (https://github.com/cosmos/cosmos-sdk/issues/7099) we started with an approach where each message / data will have a corresponding schema (https://github.com/cosmos/cosmos-sdk/issues/7097 and ADR-038) to properly assign a merkle proof to an off-chain data. I was thinking about an alternative design, influenced by a TurboGeth work I inspected few months ago with respect to the report I created for this task (https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-685211612) and the general approach (https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-676527835)

Decoupling state commitment from storage

State commitment is using it's own storage (could use the same database underneeth) from the State Machine store. In case of SMTs
State machine store is a simple KV store (could be a RDBMS -> read below)

Canonical IDs

Today, only a module knows about the ID (key) for every state object. The ADR-038 tries to solve it. I am trying to find another way to solve it by using canonical IDs - an ID constructed directly from a Message.

The idea is to use a QueryServer service name (eg "cosmos.bank.v1beta1.Msg") and combine it with a hash of a serialized object and maybe some other prefix. And we hash the whole thing. Example:

infix := "<container name>"
key := hash("cosmos.bank.v1beta1.<MsgName>.<infix>.<obj_hash>")

I'm not 100% sure if this work, but if not we fall back into the Module Schema idea (which is driven by the current way we store object with sub-stores).

If we stick with the "current" key construction, then I would strongly advocate to hash the key before storing it.

SMT for state commitments.

This part is more _obvious_:

SMTkey := Canonical Object ID using one of the algorithms from the previous section
SMTvalue := hash(object).

Finally we store a (SMTkey, SMTvalue) pair in a SMT (I'm skipping here details of how this operation works, eg constructing internal nodes etc...)

Data Store

Here we need few things:

mapping from hash(object) -> object
mapping SMTkey -> object (crucial to avoid traversing SMT if we just want to query objects in State Machine).
Garbage collection
Version management

Storage engine

KV store are order of magnitude faster than RDBMS. TurboGeth is using KV based indexing (so at the end they have 3 stores).

Based on all the existing blockchain implementations, I would suggest to start with porting SMT to Cosmos, update the existing KV interface, and only later consider experimenting with RDBMS. For more sophisticated use-cases, as explained above, we will need a separate RDBMS anyway. We don't want to kill node with analytics operations.

robert-zaremba on 8 Jan 2021

👍2 👀1

We might just be the first blockchain framework that uses RDBMS

Algorand is using SQLite as their main state storage (not sure if they are using anything else in fact). I outlined this in overview of storage backends.

robert-zaremba on 8 Jan 2021

Okay, I can totally buy that a KV store would be an order of magnitude faster than RDBMS for looking up an object by hash. And for modules where that is the access pattern that fully makes sense.

For example in my project I have Utxo style transactions, where a transaction specifies a set of hashes of the existing Utxos it will spend, and a set of new Utxos to be created. Storing existing Utxos by hash in a KV store would certainly be faster than using an SQL table with a hash primary key. (I would be curious to know exactly how much faster it be, say to compare in process sqlite with an in process KV store)

My desire to use something which supports full SQL queries was about modules/access patterns which require multiple indexes or joins. For example, an order book based DEX, might have a storage schema like:
CREATE TABLE orders (creator address, asset AssetExpression, bid Boolean, limit Integer, creation Integer, expiry Integer)
Where determining the clearing price for a given asset at a given time requires looking up all orders whose creation/expiry time contains the time, and the asset matches. So like, SELECT * FROM orders WHERE asset = *AssetExpression* AND creation<=time AND expiry>=time. This could be then joined to Utxo or Account data to check that balances are sufficient for these trades.

This can of course be replicated on a KV store by manually constructing extra indexes, and using iterators / repeated lookups to simulate joins. AFAIK this is what Cosmos-SDK is currently doing in modules like staking, and is the intention of #7098. This is fine, but it's a lot of extra engineering work to basically replicate what a RDBMS is designed to do.

P.S:
Also even in the Utxo example, it would be good to have a secondary index on owner address so that an account like view could be constructed.

jlandrews on 8 Jan 2021

👍2

The most important thing is to _Decoupling state commitment from storage_ . Once this will be done, we can have different storage backends (or even 2 at the same time) and experiment further.

robert-zaremba on 11 Jan 2021

👍1

There was a need to use chat / discussion for getting into more details. Let's move here if we need to discuss anything, and keep this issue only for important notes related to decisions.

So, here is the Github discussion

robert-zaremba on 11 Jan 2021

For datastore I’m partial to LMDB or BadgerDB for all the reasons mentioned in the review doc. Faster, feature rich, and MVCC. Having a hard time distinguishing between those two, it looks like BadgerDB has the performance edge. It’s interesting to me that BadgerDB appears to be more popular and battle tested outside of the blockchain space but it appears no popular blockchain has decided to use it (as their default, anyways)? I’m curious if turbo-geth compared those two. Turbo-geth also uses an ETL preprocessing approach to sort keys before writing to the DB in order reduce write amplification, without this I am curious how much more performant LMDB actually is vs LevelDB.

I have sort of knee-jerk reaction to using an SQL DB as the primary storage due to performance reasons. An interface (e.g. tm-db) designed for operating on-top of kvstores will "work" for SQL databases but I don't think it will be performant enough for practical use. I can say from experience that substituting Postgres for LevelDB in go-etheruem does not work without a major access pattern overhaul, although SQLite should be significantly faster than Postgres for simple operations. I think the primary DB being embedded is also crucial, that would disqualify most SQL DBs but not SQLite. SQL seems "less risky" as a secondary database, populated by means of ADR 038 etc.

For state commitments, I agree SMT looks like the best option! One question, why wouldn't the value in the SMT be the object itself instead of the hash of the object?

Still thinking about the Canonical IDs.

i-norden on 11 Jan 2021

Note that several SQL databases expose some sort of high-performance key/value API which skips the SQL query engine.

For example MySQL implements the memcached protocol, and this is also supported on Amazon RDS (although unfortunately not on GCP Cloud SQL).

Postgres has hstore.

tarcieri on 11 Jan 2021

Let's continue the discussion in https://github.com/cosmos/cosmos-sdk/discussions/8297 , I will respond there to @i-norden question.

robert-zaremba on 12 Jan 2021

👍1

Cosmos-sdk: What to do about IAVL?

Summary

Problem Description

Proposals

Overhaul IAVL

Replace IAVL

Most helpful comment

All 73 comments

Decoupling state commitment from storage

Canonical IDs

SMT for state commitments.

Data Store

Storage engine

Related issues