This is linked to meta-issue #7096.
This issue discusses potential approaches to the ongoing maintenance or replacement of IAVL.
It our understanding that IAVL has effectively become an orphaned project within the Cosmos ecosystem. I don't have a good understanding of all of the issues related to IAVL, but welcome those with more context to share. /cc @ethanfrey @erikgrinaker @alexanderbez
A complete overhaul has been proposed as one solution. I don't have the context fully for why this is needed, but would appreciate those who do to share those details here.
Replacing IAVL with a more mature alternative has been proposed as well. I'll just list out the alternatives that have been mentioned here:
I think there at two separate issues:
efficient storage -- this can be done with RocksDB or something similar. Libra, for example is using RocksDB for their storage module - in fact it's a wrapper on top of RocksDB to deal with the serialization of keys and values (we need to to store multiple types of data, and key-value pairs in RocksDB are byte arrays). This wrapper enforces that all data in and out of the DB is structured according to predefined schemas.
Storage proofs / storage commitments. To prove the state transition we need to be able to prove transaction inputs and outputs. This is usually done by committing to whole state using a Merkle Root.
Depending on the use-case it would be good to have closer look at Merkle Trees alternatives for storage commitments. I will review the following papers and write a summary next week.
Pointproofs: Aggregating Proofs for Multiple Vector Commitments. It provides non-interactive aggregation of proofs across multiple commitments. This construction enables any third party to aggregate a collection of proofs with respect to different, independently computed commitments into a single proof represented by an elliptic curve point of 48-bytes.
I believe this could be very useful for light clients and all sorts of batch processes.
Merkle Mountain Range -- Merkle tree optimized for lists of sequentially appended data and its proofs.
I think we're explicitly avoiding RocksDB for various reasons. I do agree we should explore alternatives in logical storage. MMR, sparse merkle trees, and urkle trees all look like good contenders. One thing to keep in mind is that we need:
IAVL is very problematic in it's IO patterns. After Amino, it's probably the largest Cosmos SDK performance bottleneck. It also uses a lot of indirection that makes look ups very inefficient.
@alexanderbez thanks for comment. Is there any resource summarizing your decision about RocksDB?
Also, what's the use case for doing prefix iteration?
Prefix iteration is used extensively throughout the SDK. It's how we do index scans.
@marbar3778 mentioned the other day that our aim to be database-agnostic is also causing an additional burden on state tree update optimisation. may want to consider limiting database support in conjunction with this issue to leverage database-native efficiencies.
cc @musalbas if you have thoughts on use of jellyfish here
I think we're explicitly avoiding RocksDB for various reasons.
rocksdb is supported. I would not say we are avoiding it because we have done next to zero database configuration testing. It is still unclear if goleveldb is better for the sdk than others (badgerdb, boltdb, rocksdb). There are a couple users who do use rocksdb.
Cosmos-SDK, IAVL, and Tendermint has support for 5 databases. Being that the Cosmos-SDK and hopefully Tendermint are considered frameworks for building blockchains, an agnostic approach to supported databases was taken. This is nice but causes some headaches in optimizations. We can provide sane defaults for various dbs but this leads to more testing of which we have none of.
I do agree with:
I think there at two separate issues:
Efficient storage should be a separate issue and have its own discussion. I opened an issue (https://github.com/cosmos/cosmos-sdk/issues/6406) to discuss supporting fewer dbs based on testing that needs to be done. This can be a start.
Ohh I never said it wasn't supported, just from my experience, there are much better alternatives (e.g. LevelDB, Badger, etc..).
One thing to keep in mind as well is that there are libraries out there that are very well supported, tested, and benchmarked. Continuing to have to maintain IAVL especially when it's sub-optimal at best, is probably not the best strategy.
I would be great if someone would take lead on exploring alternatives.
Generally, it seems that the blockchain space is heading towards Sparse Merkle trees (SMT) for state trees - including Ethereum 2.0 (although it seems now they're leaning towards polynomial commitments), Libra (jellyfish), LazyLedger plus also Google's Trillian. However, all of these projects are using slightly different designs for the tree, and so these implementations are not necessarily interchangeable. Libra and LazyLedger use a compute-optimised SMT by replacing subtrees with only default values with a single node; Ethereum 2.0 plans to use a compute-optimised tree by using a more efficient hash function with a standard SMT; and Trillian uses a standard unoptimised SMT (afaik).
At LazyLedger we have settled on using the Libra SMT, and implemented it in Golang. Libra has an implementation in rust called jellyfish, but it is slightly different as it operates on "nibbles" (where each key is split into sections of 4-bytes each). In LazyLedger's SMT implementation the storage code is abstracted away as a "MapStore" interface, so devs can define their own storage backend.
I'm not sure about the implementation complexity of IAVL trees, but I think SMTs are relatively simple and shouldn't be difficult to maintain. The core of the LazyLedger SMT implementation is 400 lines of code. Although we store each node in the tree in the DB, so I'm not sure if that addresses the IO performance concerns.
I think Ethereum 2.0 big bet is on polynomial commitments.
@musalbas - why did you settled on jellyfish?
We're not using jellyfish, we're using our own SMT library, that adopts the same SMT design as specified in the Libra whitepaper. We chose this as it's the simplest tree design we're aware of with a O(log(k)) get/update cost. Vitalik's scheme using an optimised hash function is simpler, but there were some concerns around the security of this, and it still requires a hash operation every 16 levels in the tree.
Does Jellyfish or LL's compute-optimized SMT allow for prefix iteration/range queries? If so, that seems like a solid bet.
No I don't think so, because the keys are hashed, and so they should appear in random places in the tree. A range query would thus simply consist of multiple Merkle proofs.
Prefix iteration can be done separately. It doesn't need to be done on the storage commitment unless you need proofs / commitments while doing the iteration. Range queries can be implemented with prefix iteration.
Where the the range proofs are used? What's the use-case for them? Maybe this can be solved with aggregateable commitments (like the aforementioned pointproofs or RSA accumulators).
@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries. However, I don't know what you mean by A range query would thus simply consist of multiple Merkle proofs.
@musalbas? Can we avoid hashing the keys before insertion?
@alexanderbez , @musalbas - braking a range query into single queries for merkle proofs (assuming this is needed) sounds very inefficient.
@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries. However, I don't know what you mean by
A range query would thus simply consist of multiple Merkle proofs.
@musalbas? Can we avoid hashing the keys before insertion?
Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.
If you want to prove membership of multiple keys in the tree, that would require an individual Merkle proof per key, regardless if they are within the same "range".
It would be helpful to know where range proofs are being used for the state tree though. I understand why range proofs may be helpful for the Merkle tree of transactions, but not sure why they would be needed for the state tree.
Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.
I see. I guess this isn't a problem with IAVL because it's self-balancing...
Since SMT hash keys prior to insertion like merkle-patricia trees do, I don't see how we'd have prefix iteration unless we can figure out how to leverage the underlying physical storage somehow :(
@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries.
At present, we do not support range proofs (of more than 1 key) in the SDK. Since the subspace queries do not support returning proofs.
https://github.com/cosmos/cosmos-sdk/blob/master/store/iavl/store.go#L271
I brought it up in an issue a while ago here: https://github.com/cosmos/cosmos-sdk/issues/5241
And tried to do some work on it, but i couldn't get it to work and there were other more pressing blockers for IBC proofs so I dropped it and worked on those
Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.
I see. I guess this isn't a problem with IAVL because it's self-balancing...
Since SMT hash keys prior to insertion like merkle-patricia trees do, I don't see how we'd have prefix iteration unless we can figure out how to leverage the underlying physical storage somehow :(
I believe if you use the optimised hash function optimisation for SMTs instead of the Libra optimisation, you can have free choice over where keys are stored in the tree, and can thus support range proofs. cc @adlerjohn
@erikgrinaker is on vacation, but when he's back it would be good to understand more about the current status of IAVL and what's needed to ensure the correctness of that implementation.
Just adding another option here:
https://github.com/nomic-io/merk by @mappum
It uses RocksDB, supports checkpointing, is still an AVL tree so supports range queries, and has the pieces to support state syncing.
@erikgrinaker is on vacation, but when he's back it would be good to understand more about the current status of IAVL and what's needed to ensure the correctness of that implementation.
As far as I know, IAVL is correct - it's just slow and inefficient, and not concurrency-safe. I think this is in part because it tries to do too much itself (e.g. versioning) which should ideally be offloaded to the underlying database engine. The access patterns are also suboptimal, requiring lots of random accesses - I'm not familiar with the state of the art in Merkle trees, but there's a reason databases use e.g. B+trees instead of binary trees.
Building a performant data store is non-trivial, and not something we should spend engineering resources on if we can avoid it in my opinion. I'd be happy to help explore alternatives if necessary.
@erikgrinaker the advantage of B+trees is that in general you have faster lookup (they are more compact). This is especially important for the storage layers, which are not very fast on random access, or where you can't efficiently use hardware caches.
Exactly. I suspect an AVL tree is inherently inefficient given block devices and CPU caches, but I don't know how Merkelization affects these tradeoffs.
In Merk, the Merkle tree structure doesn't affect the lookups (every node is indexed by its key rather than its hash), so there is practically no read overhead compared to plain RocksDB.
Merk looks very promising! Do you know how it compares to IAVL? I'm sure much better, but just want to get a feel for raw numbers. Also, we'd need to probably port this to Go if there isn't already a version out there.
Do you know how it compares to IAVL?
15x - 40x higher write throughput last time I compared on my MacBook (the lower end with none of the tree kept in memory, the higher end with all of it kept in memory).
Also, we'd need to probably port this to Go if there isn't already a version out there.
Someone did port it to Go: https://github.com/tak1827/merk-go, although I'm not sure what the status of that is, plus it might be easier to just make bindings for the Rust lib rather than keep the 2 in sync.
Yeah, I'd lean towards bindings. I'm also fantasizing about using wasm for cross-language libraries, but don't know if that's viable or performant yet.
It seams that the library is not very big.
Also I'm preparing a general overview and a spec doc to put together all requirements and see the best solution (maybe we can do better than Merkle Tree).
I'm still missing a piece about range proofs - where and why do we need it?
I see a need for bulk proofs / aggregated commitments -- for light client use cases and stateless clients.
Aren't range proofs a prerequisite for absence proofs, since these are really just range proofs over the neighbors of the absent key?
Aren't range proofs a prerequisite for absence proofs, since these are really just range proofs over the neighbors of the absent key?
Depending on definition and implementation. For me a range proof is a more an aggregated commitment, or a proof over a range of values.
Absence proof is a proof that a specific value is not part of a state. With Markle Trees absence proof is implemented by providing proof of two neighbors.
I'm still missing a piece about range proofs - where and why do we need it?
It makes proofs much more efficient (in terms of proof size, and CPU/IO for the node resolving the proof) when using data structures more advanced than just simple account maps. For example, you can split a large struct into multiple keys, and still get a proof of the fields you need without needing N completely separate Merkle branches.
Although that's just one optimization, the important part is that if you've implemented the tree to support range proofs you can also likely iterate by key in the on-chain logic (very significant for on-chain orderbooks, for example).
Not necessarily - it depends on the tree design. In SMTs, non-membership
proofs don't require range proofs, just a proof to show that the longest
prefix of the key leads to a leaf with an empty value.
On 24/08/2020 19:28, Erik Grinaker wrote:
>
Aren't range proofs a prerequisite for absence proofs, since these are
really just range proofs over the neighbors of the absent key?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-679292158,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABGOEBNRI7LGUJY4F6UPN7DSCKWM7ANCNFSM4QD7TDGA.
It makes proofs much more efficient (in terms of proof size, and CPU/IO for the node resolving the proof) when using data structures more advanced than just simple account maps. For example, you can split a large struct into multiple keys, and still get a proof of the fields you need without needing N completely separate Merkle branches.
I've got that. But I don't see the use case where this will apply. Since the accounts are randomly distributed, then practically speaking, none set of proofs will represent a range. Instead, as I wrote above, I see the big use case for aggregated proofs.
Not sure why there needs to be an explicit distinction. In Merk any selection of keys can be included in a query, so it can be any combination of ranges or individual keys. https://github.com/nomic-io/merk/blob/develop/docs/algorithms.md#structure
But a concrete example for ranges we are working with is an order book, where we need to access the best orders (sorted by price). We store these orders with keys being the encoding of (price, address)
. Getting a proof of e.g. the 100 highest bids means getting a single range which will only have around log(n) hashes, whereas an aggregated proof of 100 completely random keys would be around 100 * log(n) hashes.
iterate by key in the on-chain logic (very significant for on-chain orderbooks, for example).
Many of the Cosmos SDK core modules make extensive use of this feature (basically anywhere you see an iterator, such as in governance or staking).
And, here's an example of the orderbook like @mappum suggests (https://github.com/sunnya97/sdk-dex-mvp/blob/master/x/orderbook/orderwalls.go).
@blinky3713 might want to follow along here since FOAM had some specific needs from IAVL
You can generically add key iterability to the leaf data structure, independently of the tree format for handling the relationship between leaves.
In all the usecases of key iterability I know of in the SDK, we know at app-writing time that a particular leaf wants key iterability.
Thanks @hxrts. I don't really care what happens as long as someone maintains or ports the grpc
interface with potobuf files that I worked really hard to revive and get merged. If that goes away I'm kind of at a loss of words as to how disappointed I will be after (1) writing this argument up and having multiple people from tendermint/cosmos sign off on it (at least implicitly) and (2) implementing it myself in this PR after multiple rounds of feedback, considering that I almost never write go
and it was an ordeal for me.
If you are reading this and are unfamiliar with kepler
, the haskell language version of the cosmos sdk, see here
I was checking the SDK, and I couldn't find any usage of range proofs. The Order book example uses a storage iterator, not a range proof. And the storage iterator can be implemented on the storage layer, not on the state commitments layer.
We don't construct computation proofs, so the state transition is only validated through the block acceptance: re-computation of all transactions in the block and state commitment check.
Range proofs, or partial commitments / selective commitments could be very useful for stateless client though.
I've finalized a report to review potential solutions for storage and new commitment schema (today I added conclusions). I've put it on the Dropbox paper - it's much easier to read and comment then on Github PR (Markdown in diff format is not very friendly). Later I can move it to github if needed.
Main gals:
THE REPORT has 3 sections: the state commitments review, storage review and conclusions (personal recommendation).
Would love to hear your feedback.
@mappum - I went through the Merk description (didn't go into the code). It looks interesting, however few things are not clear to me. I added them as TODO in the review I linked above. Would be great to finalize it:
- Could you asses is if Merk is using less storage then SMT? For me it's opposite: optimized SMT stores only required nodes and can compress keys (we don't store same prefixes twice). Whereas Merk is storing whole keys all the time.
I would assume the key overhead means Merk is using more storage, but IMO for the systems we're building, storage is low on the resource priority list. We can still achieve similar key compression for encoding proofs so there is not a higher bandwidth cost for queries/state sync.
- Is there any benchmark comparing Merk and SMT based commitment?
Not that I know of, what operations are you interested in comparing?
For the tree storage - would be great to keep it small so it can easily fit in RAM.
Re benchmarks:
Most of the state commitment operations are related to adding
and updating
state elements.
I think Merk will likely be faster (at least for read operations, not sure about updates) because its implementation uses implementation optimisations that various SMT implementations don't make. However, I think that the optimisations that Merk makes are behind-the-scenes implementation optimisations that can be implemented for other trees like SMTs too, without changing the underlying data structure. For example, storing nodes by key rather than by hash, to reduce the number of database accesses required to get a key from up to 256 to 1. This doesn't seem to require changing the actual data structure of the tree, just how the tree is stored locally.
For the tree storage - would be great to keep it small so it can easily fit in RAM.
For the in-memory tree representation, Merk could easily achieve the same key compression as an SMT, we're not bound to representing entries the same way as we do on disk: https://github.com/nomic-io/merk/issues/36
@musalbas , We would need to do simulation. I think it's hard to estimate the SMT part.
log(n)
, worst case is 2log(n)
(n=number of keys)log(n)
.256=log(N)
(N=domain size) -- but that should happen only to few keys (if any).Re: storing nodes by key: this will depend on the key convention, and an average key size.
avg_key_size*n
n/c * log(n)
for some constant c>1
avg_key_size <? log(n)/c
c=2.5
and n=2^25
we have avg_key_size <? 10
c=2
and n=2^30
we have avg_key_size <? 15
I have no idea what the c
could be except that it decreases when n
increases.
An interesting paper has been recently published for storage commitments, which could be an alternative for SMT: Using Homomorphic hashes in coded blockchains
I see this issue's description already links to libra-jellyfish-merkle, but see also Compact Sparse Merkle Trees, which are a very similar idea:
I guess github references from discussions → issues don't work just yet, but there is some discussion of this wrt parallelism here: https://github.com/cosmos/cosmos-sdk/discussions/8134#discussioncomment-229639
Also here's an ethresearch thread on the compact SMT paper tony posted above https://ethresear.ch/t/compact-sparse-merkle-trees/3741
I think under the propose separation of concerns (State storage as separate from storage commitments), it would be highly beneficial to consider storage solutions other than simple key/value stores. In particular classical SQL database systems.
For my own work I've been coming to the conclusion that it would be REALLY nice to have the state stored in an SQL database, which would allow for the use of multiple indexes and join queries in transaction processing. My understanding is that the mature SQL databases are extremely well supported, highly efficient, with built in support for caching, indexes, ACID compliance, complex queries, etc.
They may not be as fast for raw lookups as a simple key/value store, but I would bet any production grade SQL db would out perform IAVL/goleveldb (or any other Merkle tree on top of a kv store) for most modules. Especially any that make use of iterators, since AFAIK each step of the iterator requires (at minimum) another lookup (log(n)) in the nodedb. Whereas getting a series of results from an SQL table is done in a single query and is almost certainly O(log(n)) if the iterated key has an index.
An SQL DB would also trivialize the whole "MultiStore" process, by just using multiple tables, rather than a single tree with key prefixes. Table names could easily be prefixed by module name to create the same sort of module based isolation within a shared store.
The problem of storage commitments/proofs could then be dealt with separately, with the committed data being something like:
"tablename : serialization_of_row" (or the hash of same). The SQL DB could also be used as the backend for an SMT / Merkle based commitment scheme, though long term it would probably be most efficient to use a cryptographic accumulator which isn't directly tree based. I think this could also be made most efficient if updates to the store happen in transaction processing phases, while changes to the storage commitment are queued up and processed in a batch in the commit phase.
I am currently working on a python based proof of concept for implementing an SMT (with batch inserts) on top of sqlite. If that experiment goes well, and there is at least some support for this path, I will likely try to build such a storage service. It would probably be in Haskell, as I'm hoping to use this along with kepler.
I also think it's worth planning ahead of time for cross language usage. ie whatever storage system / storage commitment scheme the Cosmos-SDK ends up using should be accessible through a separate service (such as gRPC) so that other language SDK implementations can reuse the same service, and so that replacement services could be constructed later which satisfy the same interface.
We might just be the first blockchain framework that uses RDBMS 😆 . How would merkle proofs work? Would those come from the storage commitment layer? And if so, how is that tied to the relational/logical storage layer? i.e. how do you prove what's in SQL is actually there and part of a structure?
How would merkle proofs work? Would those come from the storage commitment layer?
Yes, the idea would be for the service to keep the storage commitment layer in sync with what is stored in the database tables. I'm thinking of the storage commitment layer as an abstract cryptographic accumulator. ie something which acts like a Set, with insert/remove operations and a way to construct proofs of membership/non-membership which can be checked against either the accumulator itself, or some digest of it (root hash in the case of a merkle tree style accumulator).
And if so, how is that tied to the relational/logical storage layer? i.e. how do you prove what's in SQL is actually there and part of a structure?
To keep them in sync, at the end of every block, the storage layer would need to send a set of additions/deletions to update the accumulator with. These would be hashes of the serializations of table rows (ie what a CSV file stores), prefixed with the name of the table they were from. So for example:
ADD:
Hash("bank_accounts : *address1* , *coins*")
Hash("bank_accounts : *address2* , 0")
DELETE:
Hash("bank_accounts : *address2*, *coins*")
Would be the result of a full transfer from address2 to address1.
So proving that some value is in the accumulator would be equivalent to a proof that the specified table has a row with that value.
It's basically like having two stores, one which is fast for inserting data and lookups (sql). And another one which doesn't have to be able to return the inserted data at all, but which is fast for creating/verifying proofs that a value was inserted (accumulator).
This would also obviate the need for #7099 and #7098, and simplify #7097.
We might just be the first blockchain framework that uses RDBMS
@alexanderbez Chain Core was using Postgres circa 2016, FWIW. Seemed to work pretty well there, and had some nice cloud deployment options (e.g. Amazon Aurora)
Who would like to take a first stab at an ADR? There are a lot of stakeholders, but I suggest one person own the ADR and solicit further design criteria and feedback from those who have contributed to this thread. If we are indeed going with SMT, perhaps someone from LazyLedger since they already have an initial implementation?
Yeah, we can surely take the first stab on an ADR and collect the input and feedback from the different stakeholders. @tzdybal started looking into replacing iavl with an SMT in the SDK and is interested in taking the lead on this :-)
A shared slack or discord channel with all parties involved could also be helpful. We'd like to understand the different use-cases and priorities better. Should we go ahead and create a shared slack/discord, too?
Let's start a channel in Discord.
Great! I can't start a discord channel myself but looking forward to it 👍🏼
BTW regarding that SQL / RDBMS discussion above:
We might just be the first blockchain framework that uses RDBMS
it might not be a "classical" blockchain but google's trillian also comes with a mysql based storage backend (among others). As far as I remember it is not meant to be used in production as it is rather slow but it works fine for testing and developing (and everything is authenticated as with the other backends).
Ideally, the APIs / interfaces we settle on would even allow RDBMS backends (although I'm skeptical about whether we'd want to recommend that).
Also here's an ethresearch thread on the compact SMT paper tony posted above https://ethresear.ch/t/compact-sparse-merkle-trees/3741
@hxrts - when doing a report I went through that thread. IIRC, LL SMT is a realization of that idea.
@hxrts , @liamsi -- I spent good amount of time analyzing this issue and looking for a good solution. So let me outline the thoughts coming from researching this subject. Happy to contribute. If you want I can kick off and port the design described below into ADR.
With @aaronc we were thinking about SQL data base with regards to: https://github.com/cosmos/cosmos-sdk/issues/7099
With Vulcanize, we started a proof of concept for streaming data from Cosmos Node to Postgresql. In Regen I'm doing another MVP for building a whole relational database (also using Postgres) for higher level use.
For the usecase linked above (https://github.com/cosmos/cosmos-sdk/issues/7099) we started with an approach where each message / data will have a corresponding schema (https://github.com/cosmos/cosmos-sdk/issues/7097 and ADR-038) to properly assign a merkle proof to an off-chain data. I was thinking about an alternative design, influenced by a TurboGeth work I inspected few months ago with respect to the report I created for this task (https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-685211612) and the general approach (https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-676527835)
Today, only a module knows about the ID (key) for every state object. The ADR-038 tries to solve it. I am trying to find another way to solve it by using canonical IDs - an ID constructed directly from a Message
.
The idea is to use a QueryServer
service name (eg "cosmos.bank.v1beta1.Msg"
) and combine it with a hash of a serialized object and maybe some other prefix. And we hash the whole thing. Example:
infix := "<container name>"
key := hash("cosmos.bank.v1beta1.<MsgName>.<infix>.<obj_hash>")
I'm not 100% sure if this work, but if not we fall back into the Module Schema
idea (which is driven by the current way we store object with sub-stores).
If we stick with the "current" key construction, then I would strongly advocate to hash the key before storing it.
This part is more _obvious_:
Finally we store a (SMTkey, SMTvalue)
pair in a SMT (I'm skipping here details of how this operation works, eg constructing internal nodes etc...)
Here we need few things:
hash(object)
-> object
SMTkey
-> object
(crucial to avoid traversing SMT if we just want to query objects in State Machine).KV store are order of magnitude faster than RDBMS. TurboGeth is using KV based indexing (so at the end they have 3 stores).
Based on all the existing blockchain implementations, I would suggest to start with porting SMT to Cosmos, update the existing KV interface, and only later consider experimenting with RDBMS. For more sophisticated use-cases, as explained above, we will need a separate RDBMS anyway. We don't want to kill node with analytics operations.
We might just be the first blockchain framework that uses RDBMS
Algorand is using SQLite as their main state storage (not sure if they are using anything else in fact). I outlined this in overview of storage backends.
Okay, I can totally buy that a KV store would be an order of magnitude faster than RDBMS for looking up an object by hash. And for modules where that is the access pattern that fully makes sense.
For example in my project I have Utxo style transactions, where a transaction specifies a set of hashes of the existing Utxos it will spend, and a set of new Utxos to be created. Storing existing Utxos by hash in a KV store would certainly be faster than using an SQL table with a hash primary key. (I would be curious to know exactly how much faster it be, say to compare in process sqlite with an in process KV store)
My desire to use something which supports full SQL queries was about modules/access patterns which require multiple indexes or joins. For example, an order book based DEX, might have a storage schema like:
CREATE TABLE orders (creator address, asset AssetExpression, bid Boolean, limit Integer, creation Integer, expiry Integer)
Where determining the clearing price for a given asset at a given time requires looking up all orders whose creation/expiry time contains the time, and the asset matches. So like, SELECT * FROM orders WHERE asset = *AssetExpression* AND creation<=time AND expiry>=time
. This could be then joined to Utxo or Account data to check that balances are sufficient for these trades.
This can of course be replicated on a KV store by manually constructing extra indexes, and using iterators / repeated lookups to simulate joins. AFAIK this is what Cosmos-SDK is currently doing in modules like staking, and is the intention of #7098. This is fine, but it's a lot of extra engineering work to basically replicate what a RDBMS is designed to do.
P.S:
Also even in the Utxo example, it would be good to have a secondary index on owner address so that an account like view could be constructed.
The most important thing is to _Decoupling state commitment from storage_ . Once this will be done, we can have different storage backends (or even 2 at the same time) and experiment further.
There was a need to use chat / discussion for getting into more details. Let's move here if we need to discuss anything, and keep this issue only for important notes related to decisions.
So, here is the Github discussion
For datastore I’m partial to LMDB or BadgerDB for all the reasons mentioned in the review doc. Faster, feature rich, and MVCC. Having a hard time distinguishing between those two, it looks like BadgerDB has the performance edge. It’s interesting to me that BadgerDB appears to be more popular and battle tested outside of the blockchain space but it appears no popular blockchain has decided to use it (as their default, anyways)? I’m curious if turbo-geth compared those two. Turbo-geth also uses an ETL preprocessing approach to sort keys before writing to the DB in order reduce write amplification, without this I am curious how much more performant LMDB actually is vs LevelDB.
I have sort of knee-jerk reaction to using an SQL DB as the primary storage due to performance reasons. An interface (e.g. tm-db) designed for operating on-top of kvstores will "work" for SQL databases but I don't think it will be performant enough for practical use. I can say from experience that substituting Postgres for LevelDB in go-etheruem does not work without a major access pattern overhaul, although SQLite should be significantly faster than Postgres for simple operations. I think the primary DB being embedded is also crucial, that would disqualify most SQL DBs but not SQLite. SQL seems "less risky" as a secondary database, populated by means of ADR 038 etc.
For state commitments, I agree SMT looks like the best option! One question, why wouldn't the value in the SMT be the object itself instead of the hash of the object?
Still thinking about the Canonical IDs.
Note that several SQL databases expose some sort of high-performance key/value API which skips the SQL query engine.
For example MySQL implements the memcached
protocol, and this is also supported on Amazon RDS (although unfortunately not on GCP Cloud SQL).
Postgres has hstore.
Let's continue the discussion in https://github.com/cosmos/cosmos-sdk/discussions/8297 , I will respond there to @i-norden question.
Most helpful comment
I think under the propose separation of concerns (State storage as separate from storage commitments), it would be highly beneficial to consider storage solutions other than simple key/value stores. In particular classical SQL database systems.
For my own work I've been coming to the conclusion that it would be REALLY nice to have the state stored in an SQL database, which would allow for the use of multiple indexes and join queries in transaction processing. My understanding is that the mature SQL databases are extremely well supported, highly efficient, with built in support for caching, indexes, ACID compliance, complex queries, etc.
They may not be as fast for raw lookups as a simple key/value store, but I would bet any production grade SQL db would out perform IAVL/goleveldb (or any other Merkle tree on top of a kv store) for most modules. Especially any that make use of iterators, since AFAIK each step of the iterator requires (at minimum) another lookup (log(n)) in the nodedb. Whereas getting a series of results from an SQL table is done in a single query and is almost certainly O(log(n)) if the iterated key has an index.
An SQL DB would also trivialize the whole "MultiStore" process, by just using multiple tables, rather than a single tree with key prefixes. Table names could easily be prefixed by module name to create the same sort of module based isolation within a shared store.
The problem of storage commitments/proofs could then be dealt with separately, with the committed data being something like:
"tablename : serialization_of_row" (or the hash of same). The SQL DB could also be used as the backend for an SMT / Merkle based commitment scheme, though long term it would probably be most efficient to use a cryptographic accumulator which isn't directly tree based. I think this could also be made most efficient if updates to the store happen in transaction processing phases, while changes to the storage commitment are queued up and processed in a batch in the commit phase.
I am currently working on a python based proof of concept for implementing an SMT (with batch inserts) on top of sqlite. If that experiment goes well, and there is at least some support for this path, I will likely try to build such a storage service. It would probably be in Haskell, as I'm hoping to use this along with kepler.
I also think it's worth planning ahead of time for cross language usage. ie whatever storage system / storage commitment scheme the Cosmos-SDK ends up using should be accessible through a separate service (such as gRPC) so that other language SDK implementations can reuse the same service, and so that replacement services could be constructed later which satisfy the same interface.