Relates to #1288. When removing Elasticsearch I don't necessary want to remove PVC, because it will lead to a data loss.
Expected behavior from k8s is to keep PVCs:
Deleting the Pods in a StatefulSet will not delete the associated volumes. This is to ensure that you have the chance to copy data off the volume before deleting it. Deleting the PVC after the pods have left the terminating state might trigger deletion of the backing Persistent Volumes depending on the storage class and reclaim policy. You should never assume ability to access a volume after claim deletion.
It may be convenient for development, but in production there will be scenarios when users would want to remove the Elasticsearch and reinstall it agaian. Maybe there will be issues with upgrading the Opeartor or other Elasticsearch resources to new version. I am working with MongoDB Operator and there are plenty of scenarios when I would want to keep PVCs, for example when I convert sharded cluster to a replica set. And it is also expected behavior for other k8s users who worked with StatfulSets before.
Here is an open issue with the discussion in k8s repository #55045. I would suggest leaving it up to a StatefulSet or at least introduce a flag whether to remove the PVCs or not. Or maybe make a decision about removing PVCs depending on there is it is scale-down or removal of a ReplicaSet.
Since the reclaim policy is decided by the storage class, and you can specify the storage class for the PVCs, it is possible to retain volumes on scale down or delete. Can you go into more details on use cases when this is not sufficient?
There's an upstream issue here for allowing ssets to delete PVCs as well (also linked in the original issue you mentioned): https://github.com/kubernetes/kubernetes/issues/55045#issuecomment-559524385
In the documentation of StatefulSet they explicitly mention that one shouldn't rely on volumes:
You should never assume ability to access a volume after claim deletion.
When PVC gets removed and new PVC is created with same name it gets bound to a new volume, not the existing one. So in this case I cannot have Elasticsearch attached to existing data automatically. I use https://github.com/rancher/local-path-provisioner.
Using StatefulSet is an implementation detail, it has been different in the past, it could be different in the future. Deleting the volume is done on purpose because it is not a good practice to reuse an old volume when scaling down/up an Elasticsearch cluster.
I agree that it is hard to reuse a PV, even if the reclaim policy is set to Retain, but I'm not sure to understand your use cases, could you give more details about what you have in mind ? If any of them makes sense then they probably should be solved by the operator itself, not as some side effects of an implementation choice.
I don't have any real examples with elastic operator yet, because I use it for only couple of days so far. I shared an example with mongodb operator when I wanted to convert sharded cluster to replica set (or vice versa) and preserve the data. Knowing that there is a replica set under the hood I expected same standard behavior from elastic operator as well when deleted the resource in my dev environment.
Even though StatefulSet is an implementation detail, I would expect service deployed as a StatefulSet to deal with PVCs in standard way.
I do agree that it is up to operator to decide, but if such decision deviates from expected default behavior of underlying building blocks and leads to a data loss (in case if resource removal), I would expect there to be a configuration option to chose such behavior.
Sorry, accidentally closed the issue by pressing wrong button. One other reason can be to look from perspective on running Elasticsearch without any containers and orchestrators. As I cluster admin I don't expect to have my data deleted if I remove the service from the host operating system. It is up to me to decide when to delete the data or what to do with the data further on.
@andrii-litvinov thanks for bringing that up. I can see how it would be useful for scaling down to 0 replicas. At this time that is not a use case we support though (and in fact explicitly prevent). If we do implement that then we will also have to revisit this discussion for PVCs, as well as the overall downscale logic.
Well, it means that in case of some issue/bug or unexpected behavior I won't have an option to delete the Elasticsearch resource and recreate it without data loss or special handling, which is unfortunate. I know that it's an extreme scenario, but how will operator handle the case when someone accidentally removed the PVC of the first replica in StatefulSet? SatefulSet itself is not capable of recreating of PVC. Will Operator be able to handle such scenarios?
The recommended path for deleting and recreating elasticsearch resources is using snapshots. The sset operator recreates PVCs, though not perfectly
Using snapshots means copying terabytes of data, which is not instant. It means that in case if upgrade of operator 1.0-beta to newer version will require deletion and recreation of resources, some significant downtime will be required to restore the snapshot, which is not something desirable in production environment. I guess that's also not good enough argument. Well, let's see if others will come up with reasons not to delete PVCs. Maybe that's not a concern at all.
From this issue https://github.com/elastic/cloud-on-k8s/issues/2308#issuecomment-570564864 there is one more case when removing and recreating Elasticsearch resource can be useful.
We could document that if a user has deployed previous Elasticsearch resources then these resources must be deleted before upgrading if they are running on an impacted K8S version.
So keeping PVCs when removing StatefulSets will make resolution of such issues way easier instead of dealing with snapshots and trying to keep volumes and manually re-creating PVCs before SatefulSet creation.
I just ran into the very same issue described here.
In our stack, we have multiple stateful services that we handle with statefulsets and volume claim templates, one of these services is Elasticsearch. So far, we had deployed ES with home-built statefulsets (as we did and still do for all other stateful services).
The feature of statefulsets not deleting the PVCs allocated via volume claim templates has proven to be very useful, as it separates the life time of data from the live time of the pods. We often used this to completely remove an installation from the cluster but keep the PVCs around to later reinstall the same installation and retain all its data.
Now I am investigating to replace our ES statefulsets with the ES operator. So far, the operator looks very promising (e.g., scale up and in particular scale down works without the cluster even turning yellow, which is a pain to do manually). Alas, the PVC deletion "feature" might spoil using the operator for us.
So I would like to second the request of @andrii-litvinov and request @anyasabo to reopen this issue.
Could you not create an option to allow preserving and re-using PVCs? I guess that deleting the Elasticsearch cluster (where I would want the PVCs to be preserved) and scaling it down (where after successful shard reallocation I do not need the PVCs of the deleted nodes anymore and probably _would_ want them to be deleted) are currently very similar operations, so it might not be as straightforward as it seems.
Thanks folks for raising your concerns here. I think it's worth reopening and discussing options.
As you point out @jgoeres, there are 2 different mechanisms in place to remove volumes:
We can technically add new optional fields to the Elasticsearch CRD, such as:
preservePvcOnDeletion: true # default to false, covers 1)
preservePvcOnDownscale: true # defaults to false, covers 2)
If I understand your use case correctly @jgoeres, you would like to use something similar to the first option. Is that correct?
The volumeClaimTemplate section of StatefulSets is immutable, which means once we set an ownerReference there, we cannot remove it, and the other way around. Which means we need to make those settings immutable in the Elasticsearch spec. This is hard to achieve in Kubernetes. We can check this with the validation webhook, but this webhook may be disabled.
There are corner cases to consider when PVCs are kept around:
@sebgl
Thanks for reopening this. Yes, the "preservePvcOnDeletion: true" option would be what I need.
Since I already make use of the PVC preservation behavior of statefulsets as described, I am aware of the first corner case you describe - we have a section in the documentation of our helm chart that explicitly states that when one wishes to reuse existing data via the leftover PVCs, configuration changes to the stateful services are not allowed for this to work. So I personally am perfectly fine with this.
Basically I would have an installation with, say, a 5-node cluster. I delete it, the PVCs are preserved. Once I recreate it, I would keep all relevant settings (number of nodes, node types etc.), so the new pods would simply find the data from before (which _should_ be consistent) and continue were they left off.
I personally do not see a use case for the second setting preservePvcOnDownscale, as I would expect the same problems you describe in your second corner case. Only when going from x nodes to 0 and back to x again and NOT making any changes during the scale up the problem should not occur, otherwise it almost definitely will. Will ES notice if the copies of the shards it finds on the "old" PVC are out of date?
@andrii-litvinov @jgoeres we created a script that allows recreating an Elasticsearch cluster from its old Released PersistentVolumes: https://github.com/elastic/cloud-on-k8s/tree/master/hack/reattach-pv
Could you take a look at it (the README especially)? I'm wondering if that would be good enough to cover your use case.
Sorry for letting this topic sit idle for so long. I will give this script a try ASAP.
I'd like to second this request.
In our use case, we deploy Elasticsearch as part of a bigger system through a single Helm chart.
There are a number of stateful components in that system and Elasticsearch is one of them (it serves as the storage engine behind a Jaeger deployment).
All of the stateful components we use (RabbitMQ, Consul, Cassandra) follow the same practice of leaving PVCs behind on uninstall, which makes a lot of sense as we have a number of scenarios where we install and uninstall the system as part of our maintenance routines. All stateful components except Elasticsearch persist their data and that data is available the next time we deploy our system.
The CRD addition of preservePvcOnDeletion: true|false suggested by @sebgl would allow us to perform maintenance without losing all our Elasticsearch indexes, thus all our Jaeger traces.
preservePvcOnDeletion
@sebgl
When you say, add new feature, Do you means add this in yaml of output command "kubectl get crd elasticsearch" ... ? Coud you indicate the level of yaml shoud be this?
I don't find any documentation about this.
Thanks
I don't find any documentation about this.
It does not exist (yet) at the moment.
I was able to make PVCs remain after Elasticsearch is removed with the help of a KubeMod ModRule which intercepts the deployment of the PVCs created by Elasticsearch operator and removes their ownerReference before they are deployed to the cluster.
This makes the PVCs persist and their data is picked up and available in the next deployment of Elasticsearch as long as the Elasticsearch cluster's scale is not changed between deployments.
apiVersion: api.kubemod.io/v1beta1
kind: ModRule
metadata:
name: elasticsearch-pvc-modrule
spec:
type: Patch
match:
- select: '$.kind'
matchValue: PersistentVolumeClaim
- select: '$.metadata.labels["common.k8s.elastic.co/type"]'
matchValue: elasticsearch
patch:
- op: remove
path: /metadata/ownerReferences/0
https://github.com/kubernetes/enhancements/pull/2440 got merged to make it optional to automatically remove PVCs along with StatefulSets, or when StatefulSets are downscaled.
It does not include a combination that would I think be valuable to us: remove PVCs on downscale (data has been migrated away from those volumes anyway), but preserve PVCs on deletion of the entire cluster. This would allow restoring a cluster with existing data later on.
I added a comment to the (already merged) KEP: https://github.com/kubernetes/enhancements/pull/2440#issuecomment-785813242.
Most helpful comment
I'd like to second this request.
In our use case, we deploy Elasticsearch as part of a bigger system through a single Helm chart.
There are a number of stateful components in that system and Elasticsearch is one of them (it serves as the storage engine behind a Jaeger deployment).
All of the stateful components we use (RabbitMQ, Consul, Cassandra) follow the same practice of leaving PVCs behind on uninstall, which makes a lot of sense as we have a number of scenarios where we install and uninstall the system as part of our maintenance routines. All stateful components except Elasticsearch persist their data and that data is available the next time we deploy our system.
The CRD addition of
preservePvcOnDeletion: true|falsesuggested by @sebgl would allow us to perform maintenance without losing all our Elasticsearch indexes, thus all our Jaeger traces.