Velero: Consider consolidating per-restore logs into a single file? (Similarly for backups)

Created on 16 Jan 2018 · 26Comments · Source: vmware-tanzu/velero

Right now, for restores, we log:

validation errors into the restore API object (.status.validationErrors)
info-level logs to -logs.gz in object storage, using logrus
warnings/errors to -results.gz in object storage, as JSON

We follow a similar pattern for backups, except we don't have the third item (yet).

I wonder whether it would make sense to consolidate this down into just a single per-backup/per-restore log file (e.g. -logs.gz). Since we're now using logrus, we have the ability to use differing log levels as well as other structured logging fields to differentiate between different types of log output. The main benefit that I see of consolidating is that a user now only needs to look in a single place for the full set of results from a backup or restore. So, for a restore:

We could retain the view that we currently provide in ark restore describe by adding logic to that command to parse the log file and separate out the log statements into different sections within the command output.

Interested to see if other folks think this makes sense/is useful.

Breaking change EnhancemenDev EnhancemenUser P2 - Long-term important

Source

skriss

👍1

Most helpful comment

And we could have our phases be New, InProgress, Failed (we couldn't start at all), Processed (maybe has errors and/or warnings, but we did some work and we've done as much as we could with it).

ncdc on 16 Jan 2018

👍2

All 26 comments

See #225 for some related discussion re: backup warnings/errors.

skriss on 16 Jan 2018

I think this makes sense. Consolidating everything in one place makes for a better user experience, and I don't think altering the commands to parse the logs should be too onerous.

nrb on 16 Jan 2018

Would you move validation errors from the restore resource to the log file?

ncdc on 16 Jan 2018

That is what I was thinking. We could add a unique log field to identify them as such if we want to retain the distinction between them and errors during/after execution.

I'm not 100% convinced this is the right idea yet, so certainly interested in alternate opinions.

skriss on 16 Jan 2018

The "nice" (maybe?) thing about validation errors is you don't have to go to object storage to see them, as they're directly on the restore in kube.

Maybe once we have #155, we could drop validation errors entirely. We'd still need some code to check for things like invalid overlapping includes and excludes, but those could just be "errors".

ncdc on 16 Jan 2018

^^ yeah, I was thinking along the same lines re: schema validation vs. logical validation.

skriss on 16 Jan 2018

And we could have our phases be New, InProgress, Failed (we couldn't start at all), Processed (maybe has errors and/or warnings, but we did some work and we've done as much as we could with it).

ncdc on 16 Jan 2018

👍2

@jbeda this is an item we'd like to discuss soon

ncdc on 8 May 2018

My read on this would be that this is an internal or developer focused improvement. Would this add any value for end-users?

rosskukulinski on 17 Jun 2018

I'd also like to raise a discussion on prioritization for this. @ncdc I see you made it P1 - would love to know more, or we can discuss in the context of the larger 1.0 roadmap planning.

rosskukulinski on 17 Jun 2018

Would this add any value for end-users?

Currently, end users asking for help have to look in the 3 locations specified to see the full picture of what happened if a backup or restore didn't work as expected, so I'd say it does have end user impact.

nrb on 18 Jun 2018

Ah - I understand now. Thanks @nrb. So I'm lumping this under UX / debuggability.

Related to #305

rosskukulinski on 18 Jun 2018

Summarizing discussion from above, here's my straw-man proposal:

Backups and Restores use the following phases:
- New: the object has not yet been processed by the Ark server
- Processing: the object is currently being processed by the Ark server
- Processed: the object has been processed by the Ark server and has reached a terminal state (e.g. for backups, everything that could be backed up was backed up and the tarball/etc have uploaded to object storage; for restores, everything that could be restored was restored and the log has been uploaded to object storage)
- Failed: there was a fatal error preventing the object from reaching a normal terminal state (e.g. there was an error uploading the tarball/logs to object storage)
All info/warn/error outputs for an individual backup/restore go to a single leveled per-item log that gets uploaded to object storage
WarningCount and ErrorCount fields are added to the status for both Backups and Restores to store counts of the warnings/errors logged during the backup/restore; backups and restores can now be displayed in ark backup get / ark restore get as e.g. Processed (2 warnings, 1 error)
ark backup describe and ark restore describe display warning/error counts by default. If --details is specified, the per-item log is fetched from backup storage, and the warning/error lines are displayed as part of the output
For both Backups and Restores, validation errors (i.e. .status.validationErrors) go away; any information currently reported as a validation error gets logged at error level to the per-item log file, and instead of using the FailedValidation phase, objects will be Processed with >0 errors.
FailureReason field is added to the status for both Backups and Restores, to provide the user information about why their backup/restore is Failed rather than Processed (this is particularly useful for when the log fails to upload, and the only other place that errors can be logged currently is to the server log)

All comments welcome -- just wanted to get a draft proposal out!

skriss on 20 Nov 2018

👍1

This proposal makes sense to me. I think FailureReason for cases where the failure is to upload a log makes a lot of sense in terms of visibility.

nrb on 20 Nov 2018

One thing I'm still thinking about is how to make it more clear that a backup that's failed validation has no data uploaded to object storage. E.g. a backup with a single validation error that therefore never backed up anything might be displayed as Processed (1 error) which could be the same as for a full-cluster backup that got one error backing up a single PV. We could:

keep FailedValidation as a phase
FailedValidation -> Failed, rather than Processed
store a count of objects backed up as a summary stat on the API object and display it, e.g. Processed (0 items backed up, 1 error)
something else

skriss on 20 Nov 2018

FWIW - was thinking about https://github.com/heptio/velero/issues/286#issuecomment-440386348 again, and I think FailedValidation should become Failed (rather than processed).

skriss on 6 Mar 2019

👍1

So, after starting to dig into this, I realized that this could turn into a pretty big ugly PR if tackled wholesale. I thought about how to break it down and here's what I came up with:

For v1.0:

[x] move RestoreResult https://github.com/heptio/velero/blob/master/pkg/apis/velero/v1/restore.go#L118 out of the API package, since it's not actually part of the API, just a struct that gets JSON'ed and uploaded to object storage. This frees us up to drop it in a 1.x release.
[x] in velero restore describe, show warning/error counts right next to the status (so it'd show something like Phase: Completed (2 errors, 1 warning). Possibly also update velero restore get output to combine the phase/warnings/errors columns.
[x] add Warnings and Errors count fields to Backup's status to be consistent with Restores.
[x] make Backups consistent with Restores re: whether it ends as Failed or Completed. Failed should mean "failed to start" or "failed to upload tarball", whereas Completed (with errors) would mean individual items failed to backup.

For v1.x:

[ ] enhance velero backup describe or velero backup logs to more easily view errors, by adding filtering, grouping, formatting. Do this for backups first, since we already have leveled logs.
[ ] drop the RestoreResult type for restores, move warning/error reporting into the restore log, and enable them to be shown in the same way as for Backups

For v2.0:

[ ] Once all of the above is working well, drop the ValidationErrors field on Backups/Restores (these will go into the log as regular errors), and drop the FailedValidation phase (this will become Failed).

@nrb @carlisia does this seem reasonable? I tried to focus v1.0 work on (a) make things consistent between Backups/Restores so it's more obvious for the user, and (b) set us up for making additional changes later.

skriss on 2 Apr 2019

@nrb @carlisia would appreciate any input you have on the above - I'd like to get started on this soon.

skriss on 3 Apr 2019

I'll look at it in a little bit.

carlisia on 4 Apr 2019

You didn't mention FailureReason, does this no longer make sense to add? It sounded like a good idea.

I like this proposal as is. I especially like rolling the FailedValidation into the Failed phase. Ultimately, I think the priority should be to have 1 phase that unequivocally means "success". If this is Processed, then we definitely should go for "FailedValidation -> Failed, rather than Processed". And I don't think it hurts if this change gets postponed to v2.0.

I'm confused about this Completed (2 errors, 1 warning). How does Phase: Completed relate to Phase: Processed? I found completed in the code but having trouble finding processed. In addition, in the code it says BackupPhaseCompleted means the backup has run successfully without errors. but your proposal says to display in the describe: Phase: Completed (2 errors, 1 warning). If there are errors, how could it be completed?

I'm asking to make sure we don't have discrepancy in meaning with our phases but also for my understanding.

Other then that tho, I like it that you broke out part of this effort into future releases, that is great. 👍

carlisia on 4 Apr 2019

For the v1.0 milestone, how does the user then get this combined stream? velero restore logs? So describe only displays counts and no other information?

nrb on 4 Apr 2019

Let's chat about this in person next week, I think some discussion will be helpful.

skriss on 4 Apr 2019

👍1

@carlisia @nrb responses to your comments. We should still discuss in person.

You didn't mention FailureReason , does this no longer make sense to add? It sounded like a good idea.

Yeah, this still makes sense, and can be added for v1.0.

For the v1.0 milestone, how does the user then get this combined stream? velero restore logs? So describe only displays counts and no other information?

Based on the latest proposal, not much changes for v1.0 in terms of seeing the results. velero restore logs plus velero restore describe would still be used to see the overall picture. The work to combine into a single stream would be done in a v1.x.

Per https://github.com/heptio/velero/issues/286#issuecomment-479014480, the main work item that I actually want to tackle for v1.0 is deciding on the semantics of Completed vs. Failed, and then making it consistent across backups + restores. Specifically:

does Completed (or Processed) mean "Completed with no errors"? Or can it mean "Completed with some errors backing up/restoring individual items"?
conversely, does Failed mean "There were >0 errors encountered during backup/restore"? Or does it mean "Fatal error, such as unable to upload/download backup tarball"?

skriss on 9 Apr 2019

We could also decide to go to Completed, CompletedWithErrors, and Failed as the phase set, to differentiate.

skriss on 9 Apr 2019

I like Failed as "fatal error, couldn't perform a key operation to start the restore," and CompletedWithErrors or PartialCompletion for a restore where some resources can't be restored. The overlap I see there is if we failed to restore anything into k8s, but velero operations had no error, is that Failed, or CompletedWithErrors/PartialCompletion? If number of artifacts restore == 0, I could see how the latter would be confusing.

nrb on 9 Apr 2019

For me it hinges on how useful the output of a CompletedWithErrors would be. If 100% unusable, then it should be a plain Failed. I suspect there's a spectrum and maybe we can't even detect that.

carlisia on 9 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Velero pod not running in GCP-some backup storage locations are invalid

MeghanaSrinath · 4Comments

Provide multi-arch Docker images for AWS plugin

onedr0p · 3Comments

How to use Prometheus to monitor velero?

my1990 · 3Comments

[v0.10] update `ark backup describe` for the changes to volume snapshot info storage

skriss · 4Comments

PVRs don't have their status set when running in namespace other than velero

nrb · 4Comments