Cloud-on-k8s: Support dynamic PVC resize

Created on 29 Jan 2019  路  9Comments  路  Source: elastic/cloud-on-k8s

If the storageClass supports it, we should be able to resize a PVC at runtime, without the need to create a new pod with a new larger PVC.

Edit: we decided to workaround the StatefulSet limitations and implement pvc resize ourselves. See this post.


For users landing there because they want to resize their existing volumes: the current workaround is to rename the nodeSet to resize and give it new storage reqs. This triggers data to be copied over to new volumes by ECK.

For example, if we have:

spec:
  nodeSets:
  - name: default
    count: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
        storageClassName: standard

And we want to resize from 5GB to 10GB, we can do the following:

  • Rename the nodeSet default to default2
  • Increase the storage request
  • Apply the yaml manifest
spec:
  nodeSets:
  - name: default2
    count: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        storageClassName: standard

ECK will create Pods for the new nodeSet default2 with the requested storage size (10Gi). Then, it will migrate data away from Pods in default. Once data migration is over, all Pods from default are removed automatically. Migrating data may take some time depending on how large it is.

>feature

All 9 comments

This gets a bit trickier with the StatefulSets implementation, since we cannot edit the VolumeClaim section of the StatefulSets spec.

What we can technically do is:

  • resize the PVC/PV
  • (maybe) restart the pod (not sure that's required, to be checked)

But any new replica added to the StatefulSet would still use the initial volume size.
This could be handled by the StatefulSet controller at some point, we may not want to rush into our own hacky implementation.

A comment near the end of the issue mentioned a potential workaround to still use stateful sets and update PVCs:
https://github.com/kubernetes/kubernetes/issues/68737#issuecomment-498470138

Directly changed the StatefulSet pvc's size on the pvc itself
Removed StatefulSet using: kubectl delete sts --cascade=false [statefulset-name] (_sabo note: this is propagationpolicy: orphan in the golang sdk_)
Re-created the StatefulSet with the correct size on the claim

Though it's still a little spooky and there's plenty of room for error. And if the storage class doesn't support online resizing it becomes somewhat complicated. But it's nice that there is a workaround if we need it before ssets support resizing.

Status update: looks like the KEP PR for resizing stateful set PVCs has been stalled since September 2019:
https://github.com/kubernetes/enhancements/pull/660

And I didn't see any others.

I did a few experiments on how PVC expansion currently works and how we could work around VolumeClaimTemplates immutability in StatefulSets.

The general idea is as follow:

  • relax the validation constraint that prevents any change on the volume claim templates, to allow increasing the storage size, if the specified storage class has allowVolumeExpansion: true
  • do not allow storage size decrease though, since it is not supported at all (and rejected in the pvc validation webhook)
  • as part of the Elasticsearch controller:

    • if the specified storage request does not match the one specified in the existing StatefulSet, consider patching the existing PVCs, but keep the existing statefulset volume claim templates

    • for troubleshooting/support, annotate the StatefulSet with the resized claim size so we keep track of it

    • update all existing pvcs for that StatefulSet with the new required storage req

    • in turn, the volume provisioner handles pvc and filesystem resizing. It can be done online (no pod restart) in many cases (I tried TopoLVM and GKE default volumes). If not, we might need to restart the Pods ourselves as part of the reconciliation (similar to what we'd do for a normal rolling upgrade).

Heres is a draft PR to try it out: https://github.com/elastic/cloud-on-k8s/pull/3702.

I think there are a few discussion points regarding that possible implementation:

  • Eventually this feature could be implemented in the StatefulSet controller directly (could be several years out, though). So this is only a stop-gap until we have the "real" implementation available. Is it worth the trouble? Once that's the case, we may still want to rely on our implementation for older k8s versions which do not support it yet.
  • Once that's the case, how do we know we can switch to the builtin statefulset resize instead of our own fix? Probably by checking the k8s version. Even without this fix we'd still have to do it anyway.
  • The statefulset VolumeClaimTemplates is still immutable: when scaling up a nodeSet that had a resize applied previously, it means a PVC will first be created with the old initial size, then be resized with the new one.
  • By inspecting the storage class allowVolumeExpansion field, we know whether we can resize a volume or not. Is there any benefit in having an additional feature flag somewhere to enable this: I'm thinking about spec.nodeSets[*].resizePvc: true in the Elasticsearch spec?
  • In some cases it's not necessary to restart the Pods after their PVC has been resized. In some others it is. The PVC status gets updated with Waiting for user to (re-)start a pod to finish file system resize of volume on node in both cases. I thought we could just act based on that status (include the Pod in our regular rolling upgrade path), but this status is displayed even when online resizes do work: it just disappears automatically after a few seconds. We could either:

    • do nothing and consider it's the user responsibility to delete the Pods if necessary

    • detect cases where the pvc status has been there for a while (eg. more than X minutes) - I'd personally vote against that one

    • add an additional spec.nodeSets[*].restartPodsOnPvcResize: true field in the Elasticsearch spec so we know whether we're supposed to restart a Pod or not. If set to true, we can just include the Pod in the list of Pods part of a rolling upgrade.

  • We cannot know in advance whether the resize will work. For example a user could request a resize from 1GB to 1PB, which may end up failing at the volume driver level. This can be reflected in the PVC status. At that point though the PVC spec is still set to the invalid size, and it is not possible to set it back to a lower value (eg. 1GB).
  • It may be misleading to have the resized storage size in the volumeClaimTemplates section of the Elasticsearch resource not match the volumeClaimTemplates of the StatefulSet resource. One way to keep this in sync would be to add an additional resizePvc: 20Gi in the Elasticsearch spec. So instead of:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.8.1
  nodeSets:
  - name: default
    count: 1
    volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          resources:
            requests:
              storage: 20Gi  # was initially 10Gi (and still is in the statefulset), but got resized
          storageClassName: standard

we could also consider:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.8.1
  nodeSets:
  - name: default
    count: 1
    pvcResize:
        elasticsearch-data: 20Gi  # resized value, made explicit
    volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          resources:
            requests:
              storage: 10Gi  # initial value, immutable
          storageClassName: standard

It looks like the TiDB operator followed the same general idea: https://github.com/pingcap/tidb-operator/pull/3096.

This sounds awesome! Did you try out the idea mentioned in https://github.com/elastic/cloud-on-k8s/issues/325#issuecomment-518862830 to delete the StatefulSet with cascade=false to align it with the resize?

This sounds awesome! Did you try out the idea mentioned in #325 (comment) to delete the StatefulSet with cascade=false to align it with the resize?

Yes! I just tried it: it seems to work fine. It correctly remove the StatefulSet resource but keep the Pods intact. The ownerRef from the Pods is removed. I think --cascade false translate to setting a DeletionOption PropagationPolicy to Orphan.
Then ECK automatically re-creates the StatefulSet and there seems to be no interruption anywhere.

Implementing this would nicely remove my concern above:

The statefulset VolumeClaimTemplates is still immutable: when scaling up a nodeSet that had a resize applied previously, it means a PVC will first be created with the old initial size, then be resized with the new one.

so newer Pods are created with the right size automatically.

The time period where there's no StatefulSet anymore may be a bit concerning: what if ECK crashes for whatever reason at this point? I think it's probably ok to take the risk and consider this a corner case unlikely to happen.

I am in favor of documenting it with the caveat for the corner case, and how customers might recover if this happens.

The time period where there's no StatefulSet anymore may be a bit concerning: what if ECK crashes for whatever reason at this point? I think it's probably ok to take the risk and consider this a corner case unlikely to happen.

I'll track the current work status in this comment:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sebgl picture sebgl  路  5Comments

barkbay picture barkbay  路  4Comments

Pandoraemon picture Pandoraemon  路  5Comments

thbkrkr picture thbkrkr  路  5Comments

anyasabo picture anyasabo  路  3Comments