Currently, if Velero finds an invalid BSL, it will block operations.
Change the behavior so if that Velero continues creating backups even if there are invalid BSLs, as long as there's a valid BSL for them.
runtime-controller controller that continuously validates BSLs and marks them as available (valid) or unavailable (invalid)velero backup-location get (https://github.com/vmware-tanzu/velero/issues/2489)status based on the state of the BSLs (#2488)unavailable, but at least one other is availableavailable, but no BSL matches the specified defaultunavailableWhat steps did you take and what happened:
We're running into a scenario where if the Velero pod restarts for whatever reasons (edits to the Velero deployment, pod rescheduling etc) and there happens to be some backup storage locations that are unreachable due to the backup server being unavailable (upgrades etc), Velero pod keeps crashlooping due to these invalid backup storage locations. This prevents backups to other valid backup storage locations.
What did you expect to happen:
Velero pod should not crashloop if one BSL is invalid since this blocks backups to any other valid storage locations.
The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velerovelero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yamlvelero backup logs <backupname>velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yamlvelero restore logs <restorename>Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
velero version): v1.1.0velero client config get features): kubectl version):/etc/os-release):馃憤 I think we likely want to move to having a controller for BSLs that independently validates them and marks them as usable or not usable vs. doing it at server startup, both for the reasons that you cover and because if a new BSL is created after the server starts up, currently it doesn't really get validated - could just cause a subsequent backup to fail if it can't be connected to.
Relevant work: https://github.com/vmware-tanzu/velero/pull/2382
I'm still a little unclear on what user experience we're shooting for. In my mind, the relevant scenarios here are:
at server startup:
at some time after server startup:
How do we want to handle these states and indicate them to users? Some possible options:
velero statusMy 2c on desired behavior:
I agree with your summary @skriss, with one special case:
If the current default BSL is invalid, but others are valid, how do we indicate that backups can proceed so long as they have alternate locations? Or do we rely on users simply checking the output of velero get backup-locations, seeing which are ready and/or valid, and reconfiguring the default?
I'm thinking thru this and writing it up again to make sure I nail all the details. I appreciate the input, some of this is different than what I had in mind but makes sense.
If the current default BSL is invalid, but others are valid, how do we indicate that backups can proceed so long as they have alternate locations? Or do we rely on users simply checking the output of velero get backup-locations, seeing which are ready and/or valid, and reconfiguring the default?
Option 1)
velero get backup-locationsOption 2)
velero status a listing of all BSLs, and indicate the status of each (ready/waiting)I like option 2 better, because we can make velero status aggregate the most relevant info and state the user needs to know, without requiring them to read the docs to learn where to find things.
Updated the description.
+1 on the above suggestions. The only concern I have is crashlooping when there's no BSL -- in scenarios where Velero is deployed without a default BSL, the expectation is Velero should not crashloop, right? I'm thinking of use cases where Velero is running as part of a larger platform where BSLs are created on-demand prior to backups being triggered.
@betta1 I would expect that as soon as a valid BSL is created, the Velero deployment would then recover. This is in line with a modern, cloud-native application design.
Could Velero's status be communicated through that system to show that it will be fine once a backup is started?
@nrb the tricky part is discerning if the crashloop is because there's no BSL or it's as a result of some other error condition. For example the system auto-deploys Velero and installs the necessary plugins (objectstore plugins etc) and then verifies the installation was successful by checking that the Velero Pod is up and running. If Velero Pod status starts in a crashloop, it'll be difficult for the system to know if the crashloop is caused by, for example some plugin error vs no BSL.
For the case where Velero is deployed without a default BSL, could we allow the Pod to stay up and running instead of crashlooping it since we did not configure a default BSL?
Got it, that can be challenging. Let me think on this some more.
I know it's possible to inspect the pod logs, the deployment, and events for this kind of information. I'm wondering if we can emit in a way that makes it easier for external systems to discern why, while also letting Velero lean on some of Kubernetes's features.
Agreed. Also how will external systems utilize the /livez http endpoint -- does it mean Velero will need to be exposed via a service for this endpoint to be reachable to external systems for them to query the status?
One other question I had is how the BSL readiness will reconcile with the backupsync controller. Both the BSL readiness and backupsync processes will be running continuously and each process opening new connections to the storage target, we'll need a mechanism to reconcile this so that Velero does not inundate the storage target with too many connections.
Based on the new feedback here and on today's community meeting, I updated the description this issue. I'd appreciate a sanity check.
c/c @skriss @nrb @ashish-amarnath @betta1
@carlisia not sure what was updated on this ticket.
The description.