Velero: EBS VolumeSnapshotter fails to create volume from snapshot if the snapshot is still pending

Created on 23 Aug 2019 · 14Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:
Restoring a volume with an EBS snapshot sometimes fails if the snapshot is still pending. Most of the time I don't hit it, but it happened at least once. The case where it's likely to come up is where the backup is happening from one cluster and the restore is happening into another cluster immediately following, since there is no enforced downtime around namespace/volume deletion.

What did you expect to happen:
I expected the volume creation to succeed.

The output of the following commands will help us better understand what's going on:
Error message from restore: error executing PVAction for persistentvolumes/pvc-89e0cd19-c53c-11e9-9f8d-0a0e2c8f9874: rpc error: code = Unknown desc = IncorrectState: Snapshot is in invalid state - pending

One possible solution is for the aws volume_snapshotter to check for the snapshot state before moving on. This actually solves two potential errors:
1) if the volume state is 'error', then we can just return an error without attempting to create
2) if the volume state is 'pending', then wait and try again

Here's an idea of what I'm suggesting. As a first approximation, just make a Sleep call and try again. The real answer would need to include a timeout and possibly a better way of waiting, but this gets the concept down.

Basically replace the lines here:
https://github.com/heptio/velero/blob/master/pkg/cloudprovider/aws/volume_snapshotter.go#L104-L106
With something like this (timeout, error checking, etc. omitted to just show the concept):

    if count := len(snapRes.Snapshots); count != 1 {
        return "", errors.Errorf("expected 1 snapshot from DescribeSnapshots for %s, got %v", snapshotID, count)
    }
    for (snapRes.Snapshots[0].State == "pending" {
        time.Sleep(time.Second * 5)
        snapRes, _ = b.ec2.DescribeSnapshots(snapReq)
    }
    if snapRes.Snapshots[0].State == "error" {
        return "", errors.Errorf("Snapshot is in 'error' state")
    }

Bug Needs Product P2 - Long-term important

Source

sseago

All 14 comments

@sseago's suggestion sgtm, but an alternative way might be to have the controller check at the start of the restore if the snapshots are ready. If they aren't, then the restore can be requeued (with some backoff) and retried later.

prydonius on 9 Sep 2019

Prioritized Backlog

skriss on 16 Sep 2019

I'm thinking that we may need to add another element to the VolumeSnapshotLocation CRD here. While we can set a reasonable timeout limit for how long to wait for pending snapshots to something between 10 and 30 minutes, large volumes could take much longer than this. If a user is restoring a 1TB volume, for example, it could take several hours for the snapshot to hit the ready state. We need a way for a user who has larger-than-normal volumes (or who has previously failed restores due to timeouts on large volumes) to specify a longer timeout. Given that this is all happening in a plugin, and we're already passing in the VSL.Spec.Config struct, the obvious place to put this configuration is in the VSL itself.

Should we add a new field volumeSnapshotLocation.Spec.Config.Timeout? I'm not sure if there's a better more specific name. RestoreTimeout, since we're applying it on the restore side, or CreationTimeout, since we're specifically talking about waiting until the snapshot is fully-created and ready to use. If it's empty, we'll use a default (30 minutes? 20? 10?)

For the blocking case (the short-term fix I'm working on for just our local build), this would be pretty easy to use -- just pass it in as an argument to wait.PollImmediate (added to volume_snapshotter). For the requeuing solution, it would be trickier to apply. Assuming the solution here is to add a new method to the plugin API to call prior to starting restore -- readyToRestore or something like that, this method would need to figure out whether to return "yes, ready", "no, requeue", or "don't bother; fail the migration". Basically a bool, err return val, where in the error case we'd pass an error, either "Snapshot %s is in an error state" (or not found, or whatever), or "Snapshot timeout reached", and with no error, true would mean "ready to restore" and false would mean "requeue". What's less obvious is how to apply the timeout in this situation. Perhaps the first time the plugin method is called for a given restore would set the initial time, and then on subsequent calls, time elapsed vs. the vsl timeout would be compared. Also, in the requeue case, the method would have to look at all snapshots for a given restore, while in the blocking case, we're looking at one snapshot at a time -- this means that the actual meaning of the timeout would be somewhat different in these cases -- in the requeue case, it's a single timeout for the restore, while in the blocking case, it's per snapshot, but since the blocking case is a short-term fix that will go away once the upstream fix is done, I'm less worried about it.

sseago on 11 Nov 2019

Actually, I guess it wouldn't really requre a CRD change, since Spec.Config is a map. It would just be another entry in the map that the AWS plugin would look for.

sseago on 12 Nov 2019

Yep, that's right. I might go with something like snapshotCreationTimeout as the name. If we end up wanting something different down the road, we could always support reading both in for awhile.

I do wonder if long-term, we want the status of the snapshots to be reflected in the Backup status - so the backup controller could be periodically checking the state of the snapshots after they've been triggered, and update the Backup's phase once they've been completed. That way, the restore controller would just need to ensure the backup is in a certain phase.

skriss on 12 Nov 2019

Handling it on the backup actually makes more sense -- really, the backup isn't complete until the snapshots are done. Would you just keep the backup InProgress until the post-backup plugin api call indicated that the volume snapshots were complete, or would there be some new state "WaitingOnSnapshot", that would move to "Completed" once it's done. Either way, the configuration should be the same (so whatever short-term solution I put in place right now on the restore should still work with the same creation timeout configuration when we drop it in favor of whatever upstream does for the long-term fix). Also, as long as the backup doesn't go into the Completed state until after the snapshotter says it's done, our existing code that waits for backup completion before attempting a restore shouldn't require any changes.

sseago on 12 Nov 2019

I think a new phase that indicates we're waiting for the snapshot data to replicate makes sense. And we'll still want that "wait for snapshot completion" step to be non-blocking, so subsequent backups can proceed.

Couple related things we'll want to consider:

we'd still want any post hooks to run immediately after the snapshot was "created", rather than waiting until they're completed/replicated
for some storage platforms (typically on-prem), it may be possible to restore from a local snapshot after it's created, before the data is replicated off to some durable backend. so, we may want some provider-specific way to indicate whether the snapshot/backup is restorable yet, independent of whether it's completed replicating its data somewhere

skriss on 12 Nov 2019

👍1

"for some storage platforms (typically on-prem), it may be possible to restore from a local snapshot after it's created, before the data is replicated off to some durable backend. so, we may want some provider-specific way to indicate whether the snapshot/backup is restorable yet, independent of whether it's completed replicating its data somewhere"
That makes sense, although it gets tricky in terms of how to translate this into backup state. If it's (locally) restorable but not restorable from remote data, then "Completed" doesn't seem right. Perhaps a second (optional) state would be needed. So InProgress -> WaitForSnapshot -> (only where supported) LocalSnapshotCompleted -> Completed, but as you say, once InProgress changes to WaitForSnapshot, the next backup continues and it's only a background polling process that would update the backup to the next state (or two). Also, I'm assuming in the case where either the snapshot was fast enough to complete before the first post-backup API call or in cases where there are no snapshots to wait for, we'd go straight to Completed just like we do now.

sseago on 12 Nov 2019

We just discovered that GCP has the same problem (for the same reasons). This isn't unexpected, and I think everything discussed above still applies. When the plugin API is updated to allow post-backup snapshot status checks, then both GCP and AWS (and probably Azure too) will need to check snapshot status, taking into account the snapshotCreationTimeout in the VSL.

sseago on 14 Nov 2019

👍1

CSI snapshots represent some of this via the ReadyToUse flag, which more or less is left to the storage system's meaning of "ready." On AWS and GCP, it would mean pushing to object storage, and on Ceph or vSphere it would probably mean replication to different nodes then copying outside the cluster.

From what I've seen of the CSI snapshotting implementations, they return before ReadyToUse is set to true in order to handle other operations, but the common controller won't use the snapshot for a restore/clone until the field is set.

I think keeping a way for individual storage systems to communicate their requirements makes sense, since they can operate differently. I'm not sure at the moment if that means multiple backup states, multiple fields, or something else. I do think we might lose some fidelity as we move to CSI, though, if we rely only on the ReadyToUse field.

nrb on 21 Feb 2020

@nrb Do you know if this actually waits with CSI snapshots? We are on GCP and we have no way to know that a backup is ready other than doing a time.Sleep(3 * time.Minute).

Does velero wait for CSI snapshots to be complete before marking the backup as complete?

arianitu on 9 Dec 2020

@nrb Just following up again. We have patched the velero-plugin-gcp to wait for the volume snapshot to be ready. But long-term, it seems like CSI is the future.

Do you know if the CSI plugin correctly handles restoring a snapshot if it's still pending. For example:

Velero makes a backup and creates a snapshot (with CSI driver)
The user initiates a Restore immediately after the Backup is in the Complete state
Velero starts the restore

When Velero does Step 3, does the system wait for the snapshot to be ready, or does Kubernetes handle it? Right now, for velero-plugin-gcp, it will fail because it attempts to use a snapshot that is not ready, so we added logic to wait for it.

Do we have to do that for CSI? An answer would be really appreciated!

arianitu on 14 Jan 2021

If anyone is curious, we talked to the Google folks managing the gke-csi diver and the CSI driver will handle this case for you. This may be relevant for you @nrb since I don't think velero needs to wait for snapshots to be ready, so I guess you guys do not need to do anything.

Basically, the PVC provisioner will retry forever until the VolumeSnapshot is readyToUse. So in theory, the EBS and GKE volume snapshotter bugs are no longer relevant if you use the CSI snapshot.

arianitu on 15 Jan 2021

Closing because fixed by #3533