Velero: Add a BSL controller to handle validation + update BSL status phase

Created on 16 Oct 2019 · 15Comments · Source: vmware-tanzu/velero

Currently, if Velero finds an invalid BSL, it will block operations.

Change the behavior so if that Velero continues creating backups even if there are invalid BSLs, as long as there's a valid BSL for them.

HOW:

[x] add a runtime-controller controller that continuously validates BSLs and marks them as available (valid) or unavailable (invalid)
[x] change the server behavior to no longer validate BSLs

Complimentary work

Add a column to indicate the BSL status on velero backup-location get (https://github.com/vmware-tanzu/velero/issues/2489)
Have the BSL controller continuously update the server status based on the state of the BSLs (#2488)

- Add a cmd to fetch server status (https://github.com/vmware-tanzu/velero/issues/979)

BSL controller behavior:

log a warning

multiple BSLs, any one of them is unavailable, but at least one other is available
multiple BSLs, all available, but no BSL matches the specified default

log an error

no BSL, OR
existing BSLs, all unavailable

Backup behavior

no change

Original issue:

What steps did you take and what happened:
We're running into a scenario where if the Velero pod restarts for whatever reasons (edits to the Velero deployment, pod rescheduling etc) and there happens to be some backup storage locations that are unreachable due to the backup server being unavailable (upgrades etc), Velero pod keeps crashlooping due to these invalid backup storage locations. This prevents backups to other valid backup storage locations.

What did you expect to happen:
Velero pod should not crashloop if one BSL is invalid since this blocks backups to any other valid storage locations.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version): v1.1.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

EnhancemenUser

Source

betta1

👍1

All 15 comments

👍 I think we likely want to move to having a controller for BSLs that independently validates them and marks them as usable or not usable vs. doing it at server startup, both for the reasons that you cover and because if a new BSL is created after the server starts up, currently it doesn't really get validated - could just cause a subsequent backup to fail if it can't be connected to.

skriss on 17 Oct 2019

👍1

Relevant work: https://github.com/vmware-tanzu/velero/pull/2382

carlisia on 13 Apr 2020

I'm still a little unclear on what user experience we're shooting for. In my mind, the relevant scenarios here are:

at server startup:

there are no BSLs
there are 1+ BSLs, and some are valid/some are invalid
there are 1+ BSLs, and all are invalid

at some time after server startup:

there are no BSLs
there are 1+ BSLs, and some are valid/some are invalid
there are 1+ BSLs, and all are invalid

How do we want to handle these states and indicate them to users? Some possible options:

crashloop the server pod (could be via our code, or via liveness probe)
have backups fail
don't process backups
output in velero status

My 2c on desired behavior:

I think it's probably fine to crashloop if there are no BSLs or they're all invalid - makes it very clear that Velero is not currently operational
if there are a mix of valid and invalid BSLs, then ideally we can (a) process backups for the valid BSLs; (b) fail or not process backups for the invalid BSLs; and (c) somehow indicate which BSLs are invalid, preferably as a status field on the BSL displayed via CLI.

skriss on 14 Apr 2020

👍1

I agree with your summary @skriss, with one special case:

If the current default BSL is invalid, but others are valid, how do we indicate that backups can proceed so long as they have alternate locations? Or do we rely on users simply checking the output of velero get backup-locations, seeing which are ready and/or valid, and reconfiguring the default?

nrb on 14 Apr 2020

👍1

I'm thinking thru this and writing it up again to make sure I nail all the details. I appreciate the input, some of this is different than what I had in mind but makes sense.

carlisia on 15 Apr 2020

If the current default BSL is invalid, but others are valid, how do we indicate that backups can proceed so long as they have alternate locations? Or do we rely on users simply checking the output of velero get backup-locations, seeing which are ready and/or valid, and reconfiguring the default?

Option 1)

check the output of velero get backup-locations

Option 2)

include in the output of velero status a listing of all BSLs, and indicate the status of each (ready/waiting)

I like option 2 better, because we can make velero status aggregate the most relevant info and state the user needs to know, without requiring them to read the docs to learn where to find things.

carlisia on 15 Apr 2020

Updated the description.

carlisia on 21 Apr 2020

+1 on the above suggestions. The only concern I have is crashlooping when there's no BSL -- in scenarios where Velero is deployed without a default BSL, the expectation is Velero should not crashloop, right? I'm thinking of use cases where Velero is running as part of a larger platform where BSLs are created on-demand prior to backups being triggered.

betta1 on 23 Apr 2020

@betta1 I would expect that as soon as a valid BSL is created, the Velero deployment would then recover. This is in line with a modern, cloud-native application design.

Could Velero's status be communicated through that system to show that it will be fine once a backup is started?

nrb on 27 Apr 2020

@nrb the tricky part is discerning if the crashloop is because there's no BSL or it's as a result of some other error condition. For example the system auto-deploys Velero and installs the necessary plugins (objectstore plugins etc) and then verifies the installation was successful by checking that the Velero Pod is up and running. If Velero Pod status starts in a crashloop, it'll be difficult for the system to know if the crashloop is caused by, for example some plugin error vs no BSL.

For the case where Velero is deployed without a default BSL, could we allow the Pod to stay up and running instead of crashlooping it since we did not configure a default BSL?

betta1 on 27 Apr 2020

Got it, that can be challenging. Let me think on this some more.

I know it's possible to inspect the pod logs, the deployment, and events for this kind of information. I'm wondering if we can emit in a way that makes it easier for external systems to discern why, while also letting Velero lean on some of Kubernetes's features.

nrb on 27 Apr 2020

Agreed. Also how will external systems utilize the /livez http endpoint -- does it mean Velero will need to be exposed via a service for this endpoint to be reachable to external systems for them to query the status?

One other question I had is how the BSL readiness will reconcile with the backupsync controller. Both the BSL readiness and backupsync processes will be running continuously and each process opening new connections to the storage target, we'll need a mechanism to reconcile this so that Velero does not inundate the storage target with too many connections.

betta1 on 28 Apr 2020

Based on the new feedback here and on today's community meeting, I updated the description this issue. I'd appreciate a sanity check.

c/c @skriss @nrb @ashish-amarnath @betta1

carlisia on 28 Apr 2020

@carlisia not sure what was updated on this ticket.

ashish-amarnath on 29 Apr 2020

The description.

carlisia on 29 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Prometheus monitoring

vitobotta · 3Comments

PVRs don't have their status set when running in namespace other than velero

nrb · 4Comments

[v0.10] update `ark backup describe` for the changes to volume snapshot info storage

skriss · 4Comments

Restic Backup - HowTo

Berndinox · 3Comments

Can we take backup of only data from persistent volumes

akgunjal · 3Comments