Velero: Add a BSL controller to handle validation + update BSL status phase

Created on 16 Oct 2019  路  15Comments  路  Source: vmware-tanzu/velero

Currently, if Velero finds an invalid BSL, it will block operations.

Change the behavior so if that Velero continues creating backups even if there are invalid BSLs, as long as there's a valid BSL for them.

HOW:

  • [x] add a runtime-controller controller that continuously validates BSLs and marks them as available (valid) or unavailable (invalid)
  • [x] change the server behavior to no longer validate BSLs

Complimentary work

  • Add a column to indicate the BSL status on velero backup-location get (https://github.com/vmware-tanzu/velero/issues/2489)
  • Have the BSL controller continuously update the server status based on the state of the BSLs (#2488)

    - Add a cmd to fetch server status (https://github.com/vmware-tanzu/velero/issues/979)

BSL controller behavior:

log a warning

  • multiple BSLs, any one of them is unavailable, but at least one other is available
  • multiple BSLs, all available, but no BSL matches the specified default

log an error

  • no BSL, OR
  • existing BSLs, all unavailable

Backup behavior

  • no change

Original issue:

What steps did you take and what happened:
We're running into a scenario where if the Velero pod restarts for whatever reasons (edits to the Velero deployment, pod rescheduling etc) and there happens to be some backup storage locations that are unreachable due to the backup server being unavailable (upgrades etc), Velero pod keeps crashlooping due to these invalid backup storage locations. This prevents backups to other valid backup storage locations.

What did you expect to happen:
Velero pod should not crashloop if one BSL is invalid since this blocks backups to any other valid storage locations.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version): v1.1.0
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
EnhancemenUser

All 15 comments

馃憤 I think we likely want to move to having a controller for BSLs that independently validates them and marks them as usable or not usable vs. doing it at server startup, both for the reasons that you cover and because if a new BSL is created after the server starts up, currently it doesn't really get validated - could just cause a subsequent backup to fail if it can't be connected to.

I'm still a little unclear on what user experience we're shooting for. In my mind, the relevant scenarios here are:

at server startup:

  • there are no BSLs
  • there are 1+ BSLs, and some are valid/some are invalid
  • there are 1+ BSLs, and all are invalid

at some time after server startup:

  • there are no BSLs
  • there are 1+ BSLs, and some are valid/some are invalid
  • there are 1+ BSLs, and all are invalid

How do we want to handle these states and indicate them to users? Some possible options:

  • crashloop the server pod (could be via our code, or via liveness probe)
  • have backups fail
  • don't process backups
  • output in velero status

My 2c on desired behavior:

  • I think it's probably fine to crashloop if there are no BSLs or they're all invalid - makes it very clear that Velero is not currently operational
  • if there are a mix of valid and invalid BSLs, then ideally we can (a) process backups for the valid BSLs; (b) fail or not process backups for the invalid BSLs; and (c) somehow indicate which BSLs are invalid, preferably as a status field on the BSL displayed via CLI.

I agree with your summary @skriss, with one special case:

If the current default BSL is invalid, but others are valid, how do we indicate that backups can proceed so long as they have alternate locations? Or do we rely on users simply checking the output of velero get backup-locations, seeing which are ready and/or valid, and reconfiguring the default?

I'm thinking thru this and writing it up again to make sure I nail all the details. I appreciate the input, some of this is different than what I had in mind but makes sense.

If the current default BSL is invalid, but others are valid, how do we indicate that backups can proceed so long as they have alternate locations? Or do we rely on users simply checking the output of velero get backup-locations, seeing which are ready and/or valid, and reconfiguring the default?

Option 1)

  • check the output of velero get backup-locations

Option 2)

  • include in the output of velero status a listing of all BSLs, and indicate the status of each (ready/waiting)

I like option 2 better, because we can make velero status aggregate the most relevant info and state the user needs to know, without requiring them to read the docs to learn where to find things.

Updated the description.

+1 on the above suggestions. The only concern I have is crashlooping when there's no BSL -- in scenarios where Velero is deployed without a default BSL, the expectation is Velero should not crashloop, right? I'm thinking of use cases where Velero is running as part of a larger platform where BSLs are created on-demand prior to backups being triggered.

@betta1 I would expect that as soon as a valid BSL is created, the Velero deployment would then recover. This is in line with a modern, cloud-native application design.

Could Velero's status be communicated through that system to show that it will be fine once a backup is started?

@nrb the tricky part is discerning if the crashloop is because there's no BSL or it's as a result of some other error condition. For example the system auto-deploys Velero and installs the necessary plugins (objectstore plugins etc) and then verifies the installation was successful by checking that the Velero Pod is up and running. If Velero Pod status starts in a crashloop, it'll be difficult for the system to know if the crashloop is caused by, for example some plugin error vs no BSL.

For the case where Velero is deployed without a default BSL, could we allow the Pod to stay up and running instead of crashlooping it since we did not configure a default BSL?

Got it, that can be challenging. Let me think on this some more.

I know it's possible to inspect the pod logs, the deployment, and events for this kind of information. I'm wondering if we can emit in a way that makes it easier for external systems to discern why, while also letting Velero lean on some of Kubernetes's features.

Agreed. Also how will external systems utilize the /livez http endpoint -- does it mean Velero will need to be exposed via a service for this endpoint to be reachable to external systems for them to query the status?

One other question I had is how the BSL readiness will reconcile with the backupsync controller. Both the BSL readiness and backupsync processes will be running continuously and each process opening new connections to the storage target, we'll need a mechanism to reconcile this so that Velero does not inundate the storage target with too many connections.

Based on the new feedback here and on today's community meeting, I updated the description this issue. I'd appreciate a sanity check.

c/c @skriss @nrb @ashish-amarnath @betta1

@carlisia not sure what was updated on this ticket.

The description.

Was this page helpful?
0 / 5 - 0 ratings