Go-ipfs: Scale data centre based IPFS nodes

Created on 4 Aug 2019  ·  17Comments  ·  Source: ipfs/go-ipfs

I'm trying to find an easy way to scale up our hosted ipfs instances in Peergos. Many hosting providers provide object storage which is much cheaper than the VMs attached storage. I'm aware of the in-progress S3 data store, but my understanding is that only one ipfs instance will be able to use the same S3 store. This is because not all data in the data store is content addressed, and thus there is scope for conflict. The obvious example is the pin set, which is stored mutably under the key "/local/pins" (based on my reading of the code - correct me if I'm wrong).

One solution would be to use ipfs cluster, but that introduces unnecessary overhead and cost and doesn't currently fit our needs. Ideally I'd like all our ipfs instances to be able to store blocks in the same S3 and use an actual database like, say mysql, for storing the pinset. This would allow the set of ipfs instances to logically act as one in terms of data stored and pin sets. The assumption here is that the data store has it's own replication guarantees, so no need for duplicates.

My current reading of the code is that the pin set is hard coded to use the datastore, and not a pluggable interface.

Is this something that sounds interesting? @Stebalien @whyrusleeping

kinenhancement

All 17 comments

The obvious example is the pin set, which is stored mutably under the key "/local/pins" (based on my reading of the code - correct me if I'm wrong).

We can currently use multiple datastores. You'd have to use the shared one for blocks and a non-shared one for everything else. The pin _set_ is currently stored in the blockstore (as IPLD blocks, actually) and the CID of the current pin _root_ is stored in a separate datastore location.

The tricky part is caching and GC:

  1. We'd have to add a way to configure IPFS to _not_ cache blockstore misses.
  2. We'd have to hard-disable GC.

If you _also_ need GC, this becomes a trickier problem.


We've also discussed using a database for metadata like pins. The sticking points in the past have been:

  1. The original dream was to make data storage self hosting. All data would be stored within an IPLD datastructure within the blockstore. However, that dream is still pretty far off so I'm now all for ditching this until we can make something like that performant.
  2. SQLite requires CGO.
  3. Switching will be a large chunk of work.

However, even if we did switch, I'm not sure I'd want to support concurrently running multiple IPFS daemons against the same pinset. That introduces a whole new level of complexity into IPFS that I'd rather not have to deal with.

How about this for a simple proposal that solves most of the problems.

1) make all mutable data stored in the datastore have its path prefixed by the node id. Then any number of ipfs nodes can trivially share the same datastore with no conflict (assuming GC is disabled). So for example the pin set root cid would be stored at "/$nodeid/local/pins"

N.B. You wouldn't need to worry about not caching block store misses, so long as you don't mind a little extra bandwidth intra data centre if a node is asked for a block it doesn't know it has.

Then the setup would then be N ipfs nodes, all using the same S3 datastore. And we've unlocked the ~10X cheaper storage.

This still leaves GC unsolved though. I don't think that can be solved without invoking something global like ipfs cluster (which would actually be logical because it already knows the global pinset). In our case we definitely need GC because encrypted data has zero duplication, and a 1 byte change in plaintext => between 4 KiB and 5 MiB of GC-able blocks.

Actually, here's a fun idea I just thought of. You could kind of approach a generational GC if you had 2 distinct datastores (e.g. 2 buckets in S3), and a way of telling all nodes to switch between them (copying only things they are pinning) then just clear the other datastore entirely.

make all mutable data stored in the datastore have its path prefixed by the node id. Then any number of ipfs nodes can trivially share the same datastore with no conflict (assuming GC is disabled). So for example the pin set root cid would be stored at "/$nodeid/local/pins"

At the moment, we have (effectively) the inverse: all blocks are stored under /blocks. You can configure IPFS to use a separate blockstore for /blocks than for the rest of the datastore. We actually do this by default: /blocks uses flatfs while everything else uses leveldb.

N.B. You wouldn't need to worry about not caching block store misses, so long as you don't mind a little extra bandwidth intra data centre if a node is asked for a block it doesn't know it has.

The issue is that, as-is, we _do_ cache misses. We'd just need to add a way to turn that off.

TL;DR: As far as I know, the only missing pieces here (assuming no GC) are:

  1. The ability to turn off caching misses.
  2. Clear instructions.

Given what you need, I'd consider taking all the pieces that make up go-ipfs and building a custom tool with two daemons:

  1. A coordinator that handles GC, pins, etc.
  2. "Servers" that all run a DHT client and a bitswap service.

The servers would coordinate with the GC service.

You could also pretty easily implement concurrent GC with some tricks:

  1. When pinning, record the pin _before_ starting. This is what we call a _best effort_ pin in go-ipfs. The downside is that GC could remove a grandchild of the pin if it's missing an intermediate node but that's a very unusual case (and we can just re-download it).
  2. When adding, create a session/transaction to keep any blockes read/written within the transaction from being GCed while the transaction is active.

Really, you could probably reuse 90% of the existing GC/pin logic.

I think I've convinced myself that we don't need anything extra apart from the S3 data store (and transactions mentioned below). This is great because we don't have the bandwidth to maintain a fork of IPFS or a distinct ipfs-datacentre project.

The two reasons for needing ipfs cluster for our use case were:

  1. Being able to pin a tree that won't fit on a single ipfs node.
  2. Enforcing a duplication/erasure coding policy for data persistence

Both of these go away with an unbounded datastore like S3.

The nice property of a having a shared S3 data store would have been that other ipfs instances could bypass the DHT lookup and retrieve immediately from S3, with zero duplication of data. I think we can achieve this anyway by short circuiting a block get before it even gets to IPFS if we know the "owner" of the block in Peergos parlance. Even if we don't do that it just means that a get on the node would retrieve the block over the DHT, and duplicate it in its own S3 store. But this will be cleaned up the next time this node GCs. So if a file of some user went viral then all our ipfs nodes would naturally end up caching it in the usual way until the load disappeared and each of them GC'd. This not only scales to handle load that is hitting our webservers, but also p2p demand from nodes elsewhere.

When adding, create a session/transaction to keep any blockes read/written within the transaction from being GCed while the transaction is active.

IPFS needs transactions/sessions to not lose data even with a single IPFS node:
https://github.com/ipfs/go-ipfs/issues/3544
We've already implemented that api on our side and we just noop it when calling ipfs until ipfs implements it as well.

@Stebalien When you state: “However, even if we did switch, I’m not sure I’d want to support concurrently running multiple IPFS daemons against the same pinset. That introduces a whole new level of complexity into IPFS that I’d rather not have to deal with.”

is that purely from a GC / unpinning standpoint?

Or could you theoretically have multiple IPFS daemons using the same pinset if they were only adding

@obo20

Both, for now. The assumption that the IPFS daemon owns its datastore is baked deeply into the application and sharing a datastore between multiple instances would require quite a bit of additional complexity. We'd need to handle things like distributed locking while updating the pinset.

Blocks are a special case because the same key always maps to the same value. That makes writes idempotent so we don't really need to take any locks.


On the other hand, I'd eventually like to extract all the blockstore related stuff into a separate "data" subsystem. When and if that happens (not for a while), that subsystem would be responsible for pins, data, and GC making it easy to replace the entire set wholesale.

@Stebalien Happy to close this now if you want?

We still need a feature to disable caching to make this feature work.

Nothing needs to change if each ipfs node uses it's own dir in the s3 bucket.

Is there a possibility of including the s3 datastore in go-ipfs itself?

Nothing needs to change if each ipfs node uses it's own dir in the s3 bucket.

Sure, but I thought you wanted to share blockstores, right? Ah, I see, you don't _really_ care about that as you don't have much deduplication anyways.

Is there a possibility of including the s3 datastore in go-ipfs itself?

It needs to stay a plugin (it's massive) but I also need to fix plugin building.

We still need a feature to disable caching to make this feature work.

Isn't that simply setting the bloom filter to zero ?

It needs to stay a plugin (it's massive) but I also need to fix plugin building.

Specifically, ~6MiB (+15%). However, I'm going to try to make it easier to pull the plugin in at compile time.

Isn't that simply setting the bloom filter to zero ?

We have two caches: A bloom filter and an LRU. We need to disable both.

We have two caches: A bloom filter and an LRU. We need to disable both.

You are talking about the LRU cache in namesys, correct ?

Edit: wait, no, this has nothing to do with the datastore. I'm confused now.

Ah, sorry, ARC, not LRU.

I'm talking about the github.com/ipfs/go-ipfs-blockstore.CachedBlockstore. If you turn the Bloom filter down to 0, you'll still get the ARC cache and there's currently no option to disable it.

Take a look at Storage in core/node/groups.go. You can see how we configure the bloomfilter size but not the ARC cache size (also a cache option).

Following on with this, now we are well and truly down the S3 blockstore route. We have our own implementation of gc that acts directly on the blockstore outside of ipfs (and we manage our own pinset). One thing that makes me nervous is that if ipfs isn't aware of any pins and ever tries to do a GC it will delete everything. Is there a way to hard disable GC?

GC won't happen if you haven't enabled it and you don't call ipfs repo gc. However, there's no "don't gc ever" flag. Want to add a config option (DisableGC)? When enabled, ipfs would refuse to garbage collect.

Was this page helpful?
0 / 5 - 0 ratings