If the storageClass supports it, we should be able to resize a PVC at runtime, without the need to create a new pod with a new larger PVC.
Edit: we decided to workaround the StatefulSet limitations and implement pvc resize ourselves. See this post.
For users landing there because they want to resize their existing volumes: the current workaround is to rename the nodeSet to resize and give it new storage reqs. This triggers data to be copied over to new volumes by ECK.
For example, if we have:
spec:
nodeSets:
- name: default
count: 3
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
And we want to resize from 5GB to 10GB, we can do the following:
default to default2spec:
nodeSets:
- name: default2
count: 3
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: standard
ECK will create Pods for the new nodeSet default2 with the requested storage size (10Gi). Then, it will migrate data away from Pods in default. Once data migration is over, all Pods from default are removed automatically. Migrating data may take some time depending on how large it is.
This gets a bit trickier with the StatefulSets implementation, since we cannot edit the VolumeClaim section of the StatefulSets spec.
What we can technically do is:
But any new replica added to the StatefulSet would still use the initial volume size.
This could be handled by the StatefulSet controller at some point, we may not want to rush into our own hacky implementation.
A comment near the end of the issue mentioned a potential workaround to still use stateful sets and update PVCs:
https://github.com/kubernetes/kubernetes/issues/68737#issuecomment-498470138
Directly changed the StatefulSet pvc's size on the pvc itself
Removed StatefulSet using: kubectl delete sts --cascade=false [statefulset-name] (_sabo note: this is propagationpolicy: orphan in the golang sdk_)
Re-created the StatefulSet with the correct size on the claim
Though it's still a little spooky and there's plenty of room for error. And if the storage class doesn't support online resizing it becomes somewhat complicated. But it's nice that there is a workaround if we need it before ssets support resizing.
Status update: looks like the KEP PR for resizing stateful set PVCs has been stalled since September 2019:
https://github.com/kubernetes/enhancements/pull/660
And I didn't see any others.
I did a few experiments on how PVC expansion currently works and how we could work around VolumeClaimTemplates immutability in StatefulSets.
The general idea is as follow:
allowVolumeExpansion: trueHeres is a draft PR to try it out: https://github.com/elastic/cloud-on-k8s/pull/3702.
I think there are a few discussion points regarding that possible implementation:
allowVolumeExpansion field, we know whether we can resize a volume or not. Is there any benefit in having an additional feature flag somewhere to enable this: I'm thinking about spec.nodeSets[*].resizePvc: true in the Elasticsearch spec?Waiting for user to (re-)start a pod to finish file system resize of volume on node in both cases. I thought we could just act based on that status (include the Pod in our regular rolling upgrade path), but this status is displayed even when online resizes do work: it just disappears automatically after a few seconds. We could either:spec.nodeSets[*].restartPodsOnPvcResize: true field in the Elasticsearch spec so we know whether we're supposed to restart a Pod or not. If set to true, we can just include the Pod in the list of Pods part of a rolling upgrade.volumeClaimTemplates section of the Elasticsearch resource not match the volumeClaimTemplates of the StatefulSet resource. One way to keep this in sync would be to add an additional resizePvc: 20Gi in the Elasticsearch spec. So instead of:apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.8.1
nodeSets:
- name: default
count: 1
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
resources:
requests:
storage: 20Gi # was initially 10Gi (and still is in the statefulset), but got resized
storageClassName: standard
we could also consider:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.8.1
nodeSets:
- name: default
count: 1
pvcResize:
elasticsearch-data: 20Gi # resized value, made explicit
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
resources:
requests:
storage: 10Gi # initial value, immutable
storageClassName: standard
It looks like the TiDB operator followed the same general idea: https://github.com/pingcap/tidb-operator/pull/3096.
This sounds awesome! Did you try out the idea mentioned in https://github.com/elastic/cloud-on-k8s/issues/325#issuecomment-518862830 to delete the StatefulSet with cascade=false to align it with the resize?
This sounds awesome! Did you try out the idea mentioned in #325 (comment) to delete the StatefulSet with cascade=false to align it with the resize?
Yes! I just tried it: it seems to work fine. It correctly remove the StatefulSet resource but keep the Pods intact. The ownerRef from the Pods is removed. I think --cascade false translate to setting a DeletionOption PropagationPolicy to Orphan.
Then ECK automatically re-creates the StatefulSet and there seems to be no interruption anywhere.
Implementing this would nicely remove my concern above:
The statefulset VolumeClaimTemplates is still immutable: when scaling up a nodeSet that had a resize applied previously, it means a PVC will first be created with the old initial size, then be resized with the new one.
so newer Pods are created with the right size automatically.
The time period where there's no StatefulSet anymore may be a bit concerning: what if ECK crashes for whatever reason at this point? I think it's probably ok to take the risk and consider this a corner case unlikely to happen.
I am in favor of documenting it with the caveat for the corner case, and how customers might recover if this happens.
The time period where there's no StatefulSet anymore may be a bit concerning: what if ECK crashes for whatever reason at this point? I think it's probably ok to take the risk and consider this a corner case unlikely to happen.
I'll track the current work status in this comment: