Cockroach: roadmap: Blob storage

Created on 21 Dec 2014  路  17Comments  路  Source: cockroachdb/cockroach

First off, I think this is awesome. I did a series of blog posts about a year ago (http://blog.justinsb.com/blog/categories/cloudata/) that posited the existence of a reliable key-value store and a blob store, and built all sorts of cool things on top of it. For example, a git store would store only the refs in consistent storage, but would store the data itself in the blob store. Similarly a filesystem: metadata in the key-value store; file data in blob storage. It might be fun for me to update that blog series using cockroachdb.

By blob storage, I mean immutable storage, where blobs are stored by their SHA-256 hash or similar. Blobs can be created and are replicated for reliable storage, but then the only "modification" allowed is deletion.

Do you have any plans/pointers for integrating blob storage? I could always just rely on S3 or similar, but this seems like something you might well be planning on supporting directly. (I presume it's not a good idea to just store these (multi-MB) blobs as values in the k-v store.)

A-sql-hibernate C-enhancement C-wishlist O-community

Most helpful comment

Running on top of Colossus is a very efficient proposition since it uses erasure coded storage. It means that if you store replicas of part of your key range within five datacenters, you end up with 5 * 1.67x = 8.3x encoding rate. Back in the days of GFS, the same configuration would mean 5 * 3x = 15x encoding rate for triplicated storage or 5 * 2x = 10x encoding for just two replicas / datacenter (gfs r=2 encoding).

Cockroach doesn't use a separate distributed file system as a dependent layer, so with five datacenters, you'd either end up with 5x encoding rate (one replica in each) or you can increase the Raft encoding to require two replicas per datacenter (or even three, though I don't think that would make as much sense). Using only a single replica per datacenter means you need to use inter-datacenter bandwidth in order to recover a lost disk or machine. Using two replicas per datacenter means you very often could rely on intra-datacenter bandwidth for recovery. The cost of using two replicas in each datacenter is in encoding efficiency (x2) and write latencies increasing both from additionally required bandwidth on writes and probably a change in the latency histogram due to more replicas participating in consensus. You could possibly get really clever with the inter-datacenter bandwidth by sending to only one replica per datacenter and having that replica responsible for forwarding to the alternate, but that would be a really onerous bit of complexity to add.

Eventually Cockroach could just as easily run using a distributed file system, but at this stage it would be a mistake to require such a complex external dependency. Most Cockroach clusters will start as single-datacenter deployments which makes the inter-datacenter bandwidth issue a moot point. Further, a very smart and cheap way to mitigate the costs of recovery in multi-datacenter clusters would be to use more reliable storage than Google uses internally. For example, hardware RAID.

I'm quite serious about the next step being CockroachFS, which would introduce erasure encoding and CockroachDB could bootstrap itself and run on top of that, making it closer to the Spanner architecture where storage is concerned.

All 17 comments

If you are willing to live with the restrictions for immutable blobs, that opens up a lot of design possibilities. I can imagine a couple different kinds of blob storage that might be useful in various situations. For example, some applications might work well in a peer-to-peer fashion, instead of using centralized servers.

I presume it's not a good idea to just store these (multi-MB) blobs as values in the k-v store.

Performance can potentially be much better with external blob storage. For a particular key-value, you are only talking to one node (the currently elected leader) which may not be located nearby, it could be in another datacenter. The leader may also be busy handling other transactions from other clients.

Presumably, if you are replicating blobs, the client can chose the closest and/or least busy server to download from. It is also easier to scale up the serving of static content.

While at Google, I worked on Colossus, which was the successor to GFS. It used bigtable to scale out file metadata and a server at each disk to store chunks of data. Aside from increased scale, it used Reed Solomon encoding for more efficient storage. I'd of course like to do something similar on top of Cockroach at some point.

You definitely wouldn't want to store multi-MB blobs as values in the KV store. I believe that leveldb internally will store anything larger than 64K as a separate file.

@spencerkimball

It used bigtable to scale out file metadata and a server at each disk to store chunks of data.

So the metadata server is also distributed and each of them only put part of the metadata into memory?

Aside from increased scale, it used Reed Solomon encoding for more efficient storage.

Reed Solomon encoding/decoding is high cost as far as I know. Is this a client-driven thing? Or it is done at the server side?

I'd of course like to do something similar on top of Cockroach at some point.

I would love to see this happen.

Hi,

I've finally just released: https://github.com/photosrv/photosrv which is aimed at handling immutable blob storage. It also needs some work in some areas, does not do any kind of encoding/decoding, but has already been proven to scale effectively to hundreds of millions of files.

Maybe there is room or possibility for an integration layer between both photosrv and CockroachDB?

Cheers,

@xiang90

The metadata was distributed, but it didn't hold the data in memory, except through bigtable's normal caching facility.

Reed Solomon encoding/decoding can be made blazingly fast with appropriate optimizations. Both in how the cauchy encoding matrix is chosen and with carefully tuned low-level inner-loop instructions. At google, this was done exclusively on the client-side for encoding, and for decoding on-the-fly in the event that requested data blocks were not available, necessitating reads of parity blocks. It was also done server-side for permanent reconstructions when machines went missing, data corruption was identified, or disks died.

@spencerkimball
Hi锛孖 ask one question.
I see that Google's spanner was built on top of Colossus锛宐ut cockroach doesn't depend on any distributed FS. What's different between them?

Running on top of Colossus is a very efficient proposition since it uses erasure coded storage. It means that if you store replicas of part of your key range within five datacenters, you end up with 5 * 1.67x = 8.3x encoding rate. Back in the days of GFS, the same configuration would mean 5 * 3x = 15x encoding rate for triplicated storage or 5 * 2x = 10x encoding for just two replicas / datacenter (gfs r=2 encoding).

Cockroach doesn't use a separate distributed file system as a dependent layer, so with five datacenters, you'd either end up with 5x encoding rate (one replica in each) or you can increase the Raft encoding to require two replicas per datacenter (or even three, though I don't think that would make as much sense). Using only a single replica per datacenter means you need to use inter-datacenter bandwidth in order to recover a lost disk or machine. Using two replicas per datacenter means you very often could rely on intra-datacenter bandwidth for recovery. The cost of using two replicas in each datacenter is in encoding efficiency (x2) and write latencies increasing both from additionally required bandwidth on writes and probably a change in the latency histogram due to more replicas participating in consensus. You could possibly get really clever with the inter-datacenter bandwidth by sending to only one replica per datacenter and having that replica responsible for forwarding to the alternate, but that would be a really onerous bit of complexity to add.

Eventually Cockroach could just as easily run using a distributed file system, but at this stage it would be a mistake to require such a complex external dependency. Most Cockroach clusters will start as single-datacenter deployments which makes the inter-datacenter bandwidth issue a moot point. Further, a very smart and cheap way to mitigate the costs of recovery in multi-datacenter clusters would be to use more reliable storage than Google uses internally. For example, hardware RAID.

I'm quite serious about the next step being CockroachFS, which would introduce erasure encoding and CockroachDB could bootstrap itself and run on top of that, making it closer to the Spanner architecture where storage is concerned.

I see锛宼hank you

We just ported our old FoundationFS demo to cockroachdb as an excercise to learn/understand cockroachdb. Overall, it was a good experience, and we intend to keep this project up to date as cockroachdb moves into beta and production. You can check it out at https://github.com/cloudmode/roachclip-fs. It's an attempt to replicate mongodb's gridfs interface.

Just to clarify, roachclip-fs is for file storage, it's not a file system.

@spencerkimball

Reed Solomon encoding/decoding can be made blazingly fast with appropriate optimizations. Both in how the cauchy encoding matrix is chosen and with carefully tuned low-level inner-loop instructions.

I just found https://github.com/tsuraan/Jerasure and https://github.com/catid/longhair. I am not sure if it worth a rewrite in go.

At google, this was done exclusively on the client-side for encoding, and for decoding on-the-fly in the event that requested data blocks were not available, necessitating reads of parity blocks. It was also done server-side for permanent reconstructions when machines went missing, data corruption was identified, or disks died.

I did some research about this recently. And I am more interested in the permanent reconstructions at server side at the moment.

My understanding is this has to be done at the server side since we want to reduce the bandwidth at client side. To reconstruct a missing block might involve at most K transmissions for a (k, m) coding.

But does this also mean that the server side has to know about the original encoding matrix? And the reconstructing needs some help from the upper level (or might be need to remember the matrix?)

@xiang90 @Google on Colossus we did both client-side and server-side reconstructions. Waiting for the server when accessing via the client would result in unacceptable latency. There are also some common situations where an optimizing data layout would result in no additional bandwidth to reconstruct (e.g. when scanning a file a "stripe" at a time). The encoding matrix is just a binary string of length k * m. We'd store that with the file metadata so the server could reconstruct at its leisure.

Here's a description of a possible data layout:

image

The above diagram shows the data layout for two "stripes", which is a convenient concept to describe pieces of a larger file. In this example, there are 8 data chunks and 4 code chunks in the stripe. Another common format might be 6.4. The metadata for a larger file was broken down by stripes. Each stripe would have 12 "chunks". In the diagram above, a chunk is represented by a vertical column containing 8 MB of non-contiguous data. There are 8 data chunks and 4 code chunks. A stripe here contains 64MB of data and 32MB of parity/code blocks. A "block" is one of the cells in the diagram.

Why layout the data this way? There are some compelling reasons. First, you expect many files to be relatively small. If you have files flirting with sizes which are within an order of magnitude of the stripe size, you would suffer from quite expensive overhead in code blocks for a file with has a size modulo (say) 16MB. In that case, the final stripe would only contain 16 MB of data blocks. If you laid out data contiguously along chunks, that would mean you'd need all 32MB of code blocks. Your final stripe of data would have an encoding rate of 3x. Not good. With the layout in the diagram, you spread the 16MB of data blocks across only the first two block rows and use only 8MB of code blocks (also just those along the first two "mini" stripes), giving you an encoding rate of 1.5x.

For similar reasons (and critical to performance), on-the-fly recovery of data is greatly enhanced by this data layout. If you're trying to read 2MB of data and one of the chunks is unavailable, you'll end up (likely) reading 8MB of data to reconstruct the missing 1MB (say, the 1MB from the 2MB which is available and 7 additional blocks from the remaining 10). This is a 4x read blowup. On the other hand, with the contiguous block layout, missing 2MB of data would mean reading 16MB (2MB from each of 8 of the remaining 12 chunks), for an 8x read blowup. Further, if you read in 8MB increments, a missing chunk with this data layout actually incurs no read blowup costs--let's say you read from 7 of the 8, then you just need 1 MB from one of the 4 code chunks). For the contiguous chunk model, you'd need to read 64MB.. Ouch--an 8x read blowup.

I recently found this Go library for Reed-Solomon encoding. It claims 1GB/s/cpu core.
https://github.com/klauspost/reedsolomon

Zendesk ticket #2743 has been linked to this issue.

OID / BLOB support would bring CockroachDB closer to a drop-in replacement for PostgreSQL.

While at Google, I worked on Colossus, which was the successor to GFS. It used bigtable to scale out file metadata and a server at each disk to store chunks of data. Aside from increased scale, it used Reed Solomon encoding for more efficient storage. I'd of course like to do something similar on top of Cockroach at some point.

You definitely wouldn't want to store multi-MB blobs as values in the KV store. I believe that leveldb internally will store anything larger than 64K as a separate file.

what does the Colossus stripe width or stripe cell size ? (for SSD or HDD D server is the same?)

@xiang90 @google on Colossus we did both client-side and server-side reconstructions. Waiting for the server when accessing via the client would result in unacceptable latency. There are also some common situations where an optimizing data layout would result in no additional bandwidth to reconstruct (e.g. when scanning a file a "stripe" at a time). The encoding matrix is just a binary string of length k * m. We'd store that with the file metadata so the server could reconstruct at its leisure.

Here's a description of a possible data layout:

image

The above diagram shows the data layout for two "stripes", which is a convenient concept to describe pieces of a larger file. In this example, there are 8 data chunks and 4 code chunks in the stripe. Another common format might be 6.4. The metadata for a larger file was broken down by stripes. Each stripe would have 12 "chunks". In the diagram above, a chunk is represented by a vertical column containing 8 MB of non-contiguous data. There are 8 data chunks and 4 code chunks. A stripe here contains 64MB of data and 32MB of parity/code blocks. A "block" is one of the cells in the diagram.

Why layout the data this way? There are some compelling reasons. First, you expect many files to be relatively small. If you have files flirting with sizes which are within an order of magnitude of the stripe size, you would suffer from quite expensive overhead in code blocks for a file with has a size modulo (say) 16MB. In that case, the final stripe would only contain 16 MB of data blocks. If you laid out data contiguously along chunks, that would mean you'd need all 32MB of code blocks. Your final stripe of data would have an encoding rate of 3x. Not good. With the layout in the diagram, you spread the 16MB of data blocks across only the first two block rows and use only 8MB of code blocks (also just those along the first two "mini" stripes), giving you an encoding rate of 1.5x.

For similar reasons (and critical to performance), on-the-fly recovery of data is greatly enhanced by this data layout. If you're trying to read 2MB of data and one of the chunks is unavailable, you'll end up (likely) reading 8MB of data to reconstruct the missing 1MB (say, the 1MB from the 2MB which is available and 7 additional blocks from the remaining 10). This is a 4x read blowup. On the other hand, with the contiguous block layout, missing 2MB of data would mean reading 16MB (2MB from each of 8 of the remaining 12 chunks), for an 8x read blowup. Further, if you read in 8MB increments, a missing chunk with this data layout actually incurs no read blowup costs--let's say you read from 7 of the 8, then you just need 1 MB from one of the 4 code chunks). For the contiguous chunk model, you'd need to read 64MB.. Ouch--an 8x read blowup.

how did you hanlde small write?

Was this page helpful?
0 / 5 - 0 ratings