Origin: Independent persistent storage for replicated pods

Created on 17 Aug 2015  Â·  51Comments  Â·  Source: openshift/origin

Trying to add persistent storage to our MongoDB replication example template, we hit into a show stopper: if a DeploymentConfig's pod template has a PersistentVolumeClaim, the same claim is reused for every replica deployed.

We need a way to provide persistent storage to a pod that scales with oc scale, so that new replicas get new PVCs and claim different PVs.

The workaround for now is to manually define:

  • _N_ pods or _N_ replication controllers with 1 replica (cannot have #replicas > 1)
  • _N_ PVCs

... and forget about oc scale.

/cc @bparees

areusability componenstorage kinfeature lifecyclrotten prioritP2

Most helpful comment

@pmorie in the case of a MongoDB replica set, each member (a container in a pod) should have it's own independent storage.

All 51 comments

@pmorie @markturansky I think you've already been involved in this discussion, can you tell us the current plans around this?

https://github.com/kubernetes/kubernetes/issues/260

Scaling storage is on the radar and the above issue talks through many of the problems and difficulties. This feature is growing in importance and we're starting to dissect the requirements on the linked issue.

No official design yet or implementation plan, but it is in the works.

@rhcarvalho Can you say a little more about your use-case so that we're all on the same page? It's not clear to me currently whether you want:

  1. Each replica to use the same storage
  2. Each replica to make a new claim

@pmorie in the case of a MongoDB replica set, each member (a container in a pod) should have it's own independent storage.

+1

+1

  • @erinboyd

Yes, but its independent LOCAL storage. Not Network.

To the best of my knowledge, in a non-containerized (traditional) MongoDB environment, you don't use network storage for individual MongoDB instances. Why are we trying to do so just because Mongo is running in containers? Shouldn't you just be specifying a HostPath (Host/Local Storage) Volume Plugin in your MongoDB RC and problem solved?

Kubernetes/OpenShift supports exposes 2 types of methods to access persistence for containers:

  • Volume Plugins (Direct)
  • Persistent Volumes (Abstracted)

Generally, the usage of Persistent Volumes means you don't care where the pod/container/app runs but you want to be able to re-connect it to the same _network storage device_ regardless of which host it gets moved to. Most scale out persistence platforms (GlusterFS, HDFS, Mongo, Cassandra) are designed to use local direct attached storage for performance reasons, and so I think for these types of platforms you want to always be using HostPath (or some future incarnation of the same feature) rather than Persistent Volumes.

@wattsteve thanks for weighting in.

Why are we trying to do so just because Mongo is running in containers? Shouldn't you just be specifying a HostPath (Host/Local Storage) Volume Plugin in your MongoDB RC and problem solved?

Relying on HostPath has several limitations as well. The obvious one: if your pod gets scheduled to a different node than the one it resided before, you lose access to existing data. HostPath is not a production-oriented feature, as warned in the documentation of OpenShift and Kubernetes, and therefore does not solve the problem.

When running a MongoDB cluster in a distributed containers fashion, the containers are ephemeral and treated as a pet, as opposed to a traditional setup in which you use the local disks (best performance), and make sure to run data backups and keep those specific disks healthy.

Perhaps we might find a solution somewhere between the extremes. What can be done today would be to run primarily on ephemeral and faster storage, with an automatic live backup to persistent storage.

That could work for a MongoDB setup, but doesn't invalidate the general need for being able to request independent PVs stated in the issue.

"if your pod gets scheduled to a different node than the one it resided before, you lose access to existing data. HostPath is not a production-oriented feature, as warned in the documentation of OpenShift and Kubernetes, and therefore does not solve the problem."

I contend this is actually not an issue as Scale Out Software Defined Storage Platforms are designed from the ground up to expect this failure domain which is why they offer data replication policies. A Pod being moved from one server to another is the same scenario as losing a server in a non-containerized solution. When the original pod goes offline, the storage platform (mongo) identifies that the amount of Mongo replicas are affected. When the new pod is started up on a different to replace the pod that went down, it is perceived as a new addition to the Mongo cluster and the data replicas are rebalanced and the new replicas are stored on the new Host Systems local storage. This is exactly how people run HDFS in EC2 with ephemeral local disks, although that is more like using EmptyDir than HostPath.

Another point I want to make is that Storage/Persistence Platforms are Pets, not Cattle. Once you stick some data in something you care about it and generally want to manage it carefully. To this point, I'd contend that not using Replication Controllers for deploying MongoDB and instead using individual Mongo Pods with NodeSelectors and HostPath (or EmptyDir) volumes is a reasonable approach - @deanpeterson is this something you've explored?

Shouldn't you just be specifying a HostPath (Host/Local Storage) Volume Plugin in your MongoDB RC and problem solved?

@wattsteve Aside from the valid issues @rhcarvalho listed, how does that work if two pods from the same RC get scheduled on the same node? they're going to use the same hostpath and clobber each other. So i need to manually define a unique hostpath for each pod i create? That seems very error prone.

Aside from this specific use case, it's the responsibility of the PaaS to provide storage to pods. Using EmptyDir or HostPath handcuffs the admin in terms of where that storage can come from. Even if i don't want the data to persist or follow my pods, i should be able to have a pod dynamically pick up network storage from a pool, and each pod in a RC should be able to get its own storage.

. To this point, I'd contend that not using Replication Controllers for deploying MongoDB and instead using individual Mongo Pods with NodeSelectors and HostPath (or EmptyDir) volumes is a reasonable approach

That removes the entire value of having an RC which makes it easy for me to scale up/down my mongo cluster..now i have to manually create/destroy pods to scale my cluster?

The upstream suggestion for this is to include a PersistentVolumeClaimTemplate on the RC side-by-side with the PodTemplate. This way, each pod replica gets its own volume.

The details are not more fully formed than that. I don't know, for example, what happens if a pod goes away (e.g, replica count decreased). Is the claim cascade deleted? If it sticks around, does it get re-used when the RC replica count increases again? These might just be policy issues with a toggle on the RC/Claim WRT behavior.

It seems relatively easy to give volumes in the same cardinal order to pods created in that order. i.e, an RC has created 7 pods, then decreases replicas to 3 (assume no PVC deletes). As the replica count returns to 7, each index (5, 6, and 7) could use the same claim it used previously.

@bparees I think I addressed @rhcarvalho's concerns in my response to his comments.

WRT the additional questions you brought up. Just so you know where I am coming from, first, I acknowledge that this is an issue if you want to use RCs and have each new pod that gets spin up get assigned a new NETWORK block device. What I'm disagreeing with is that people that want to run Mongo & Cassandra should be using Network Block Devices for their node level persistence. They normally use local disks when not using Containers for performance reasons so now that we are using containers, I'm trying to avoid making every single I/O go over the network instead of local disk for those same performance reasons.

I think what we really need to do is to augment this feature request with an RC that can schedule Pods on Kube/Origin Nodes that have available Local Storage on them. We will _also_ need an RC that can stamp out new pods that each are attached to a new block device for the scenarios where there is no local storage available and so they are forced to use network storage (and suffer the performance consequences).

To address the issue about having 2 MongoDB pods on the same host colliding on the HostPath path. I agree. Thats a real issue when using RCs with the existing scheduling abilities. As such, its looking like the best way to do this _for now_, is to not use RCs, and have curated pods that use NodeSelectors, so they are scheduled on hosts that have the right storage available. This would avoid the 2 Pod on one Node scenario, which is also bad because of shard replication failure domains. I also contend that this approach is a reasonable workaround until we have this PR resolved, as I suspect that the majority of our community have relatively small MongoDB clusters. RCs don't add a whole lot of value when you're only scaling up incrementally (and not ever scaling down) by adding one pod per host, every 3 months or so. This is an example of an implementation of what I am proposing for GlusterFS which suffers from the same issues with RCs that MongoDB does - https://github.com/wattsteve/glusterfs-kubernetes

@wattsteve and i spoke offline about this and i think we've come to an agreement that while this feature request makes sense and fills a valid need, we should proceed with the replicated mongo sample by using emptydir storage which is generally more suitable since it will likely be more performant, and aligns with the expectations of a mongo deployer that each cluster member is basically expendable.

obviously this means the mongo replica example needs to setup replicated shards to ensure that data is not lost if a single pod fails.

it also likely means the mongo image needs to wipe out the volume contents on startup, for the same reason we need to do it in mysql: the emptydir may not be empty if the container has restarted "in place".

@deanpeterson what are your thoughts on a clustered mongo offering based on ephemeral storage? (ie you are on your own to either backup the cluster, or ensure you have sufficient redundant replicas configured)

Since this issue is for tracking "Independent persistent storage for replicated pods", I would like to move the conversation about how to implement MongoDB replication to https://github.com/openshift/mongodb/issues/114.

The specific implementation of this is covered by the PetSet proposal in Kubernetes.

The specific implementation of this is covered by the PetSet proposal in Kubernetes.

link: https://github.com/kubernetes/kubernetes/pull/18016

+1

+1

Reassigning to @childsb

+1

I was trying to get a large elasticsearch cluster up and running and ran into this issue.

I want to create a large number of replicated elasticsearch data nodes, but which each use their own persistent storage. The idea that a cluster or even instance restart could lead to permanent data loss is too scary... not going to go there. Also, in our particular case we are not heavily I/O bound, so the case for using local disks is not quite as persuasive.

I took a look at PetSets and they look like they could solve the problem, but as an alpha feature, they don't seem quite ready for a production environment.

@speedplane we're working on a MongoDB replication example with persistent storage using a PetSet:
https://github.com/sclorg/mongodb-container/pull/184 (still WIP by now)

As for something you can do today without alpha features, you can use the multi deployment config / multi persistent volume claim / multi service alternative (mentioned in http://kubernetes.io/docs/user-guide/petset/#alternatives).

For your use case, you may automate the creation of a DC+PVC+SVC for each ES cluster member, and build from there.

@rhcarvalho Yes, I figured that. Just wrote a script that generates 12 nearly identical Deployment yaml files. It works, but definitely takes the elegance out of Kubernetes.

Yeah, PetSets are intended to support the "less ugly", but not there quite
yet.

On Oct 6, 2016, at 1:41 AM, Michael Sander [email protected] wrote:

@rhcarvalho https://github.com/rhcarvalho Yes, I figured that. Just wrote
a script that generates 12 nearly identical Deployment yaml files. It
works, but definitely takes the elegance out of Kubernetes.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/4211#issuecomment-251901408,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p1sWuVCCcS3fHQrO3g0E07lkO3n3ks5qxLQsgaJpZM4Fs1ga
.

@smarterclayton where to send feedback on PetSets?

In particular, for a production-grade MongoDB deployment we are lacking the ability to schedule each pet on separate nodes. Adding a node selector to the pod template would give the same selector to every pod/pet.

That's what service affinity spreading (on by default) and pod affinity is
for.

On Oct 6, 2016, at 1:44 PM, Rodolfo Carvalho [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton where to send feedback
on PetSets?

In particular, for a production-grade MongoDB deployment we are lacking the
ability to schedule each pet on separate nodes. Adding a node selector to
the pod template would give the same selector to every pod/pet.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/4211#issuecomment-252067944,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p9UiJjyKa1nWGFHwwEFpa2hp0VUmks5qxU-JgaJpZM4Fs1ga
.

Looks like those don't take PetSets into account yet, just replicasets, replication controllers, and services (https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/priorities/selector_spreading.go#L41)

Service affinity and pod affinity should not require any of those.

On Oct 6, 2016, at 7:45 PM, Jordan Liggitt [email protected] wrote:

Looks like those don't take PetSets into account yet, just replicasets,
replication controllers, and services (
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/priorities/selector_spreading.go#L41
)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/4211#issuecomment-252134168,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p_hyg4Avq0aTUopm_rsQWPc8w1jnks5qxaQzgaJpZM4Fs1ga
.

That link is for the old spreaders - pod affinity / anti-affinity is the
new thing.

On Oct 6, 2016, at 7:45 PM, Jordan Liggitt [email protected] wrote:

Looks like those don't take PetSets into account yet, just replicasets,
replication controllers, and services (
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/priorities/selector_spreading.go#L41
)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/4211#issuecomment-252134168,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p_hyg4Avq0aTUopm_rsQWPc8w1jnks5qxaQzgaJpZM4Fs1ga
.

The tried-and-true hack of assigning a host port in the pod template should still work to guarantee max of one replica per node.

But as @smarterclayton said you can also use pod affinity/anti-affinity for this (and to get much finer-grained control); see
http://kubernetes.io/docs/user-guide/node-selection/

I have just been trying out statefulsets (following http://blog.kubernetes.io/2017/01/running-mongodb-on-kubernetes-with-statefulsets.html) and while each replica is getting their own blob storage using the dynamic provisioning I would like them to use multiple storage accounts (that is on azure), as if there are issues that are specific to the storage account itself it would affect all of the replicas. That would not happen if they would be able to somehow select multiple storage-classes/storage accounts.

@Globegitter Azure managed disk is what you need: https://github.com/kubernetes/kubernetes/pull/41950

This is amazing - thanks @weinong

I would be interested in seeing this. Has progress been made for it with openshift or kubernetes?

This will be amazing if its done. Its true that we should not use this for the mongo use case but would be great for someone who don't care about the network I/O throughput.

I see that this is possible with stateful sets. I have seen the example of zookeeper where no of replicas is 3 and the volume definition is once . I got 3 different ebs volumes for all 3 replicas in aws. Does this mean we need to move to statefulsets always?

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

/remove-lifecycle rotten

/remove-lifecycle stale

I have a case where I would like for worker pods to be spooled up with a user-configurable amount of storage available. The storage would not necessarily have to be persistent and could start as emptyDir, but it would need to be high capacity.

I can probably do this with emptyDir, but unfortunately, there is no way to guarantee that the size meets the needs. hostPath might work as well, but isn't always suitable for all deployment environments.

Basically, I want something like a volumeClaimTemplate, but without necessarily being tied to having a stateful set, as these aren't really stateful pods.

I wonder if this is being progressed upstream?

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

I'm interested in an answer to @jsight's question as well.

/remove-lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mxxk picture mxxk  Â·  52Comments

cello86 picture cello86  Â·  44Comments

deanpeterson picture deanpeterson  Â·  59Comments

smarterclayton picture smarterclayton  Â·  72Comments

andrewklau picture andrewklau  Â·  62Comments