Right now, for restores, we log:
We follow a similar pattern for backups, except we don't have the third item (yet).
I wonder whether it would make sense to consolidate this down into just a single per-backup/per-restore log file (e.g.
We could retain the view that we currently provide in ark restore describe by adding logic to that command to parse the log file and separate out the log statements into different sections within the command output.
Interested to see if other folks think this makes sense/is useful.
See #225 for some related discussion re: backup warnings/errors.
I think this makes sense. Consolidating everything in one place makes for a better user experience, and I don't think altering the commands to parse the logs should be too onerous.
Would you move validation errors from the restore resource to the log file?
That is what I was thinking. We could add a unique log field to identify them as such if we want to retain the distinction between them and errors during/after execution.
I'm not 100% convinced this is the right idea yet, so certainly interested in alternate opinions.
The "nice" (maybe?) thing about validation errors is you don't have to go to object storage to see them, as they're directly on the restore in kube.
Maybe once we have #155, we could drop validation errors entirely. We'd still need some code to check for things like invalid overlapping includes and excludes, but those could just be "errors".
^^ yeah, I was thinking along the same lines re: schema validation vs. logical validation.
And we could have our phases be New, InProgress, Failed (we couldn't start at all), Processed (maybe has errors and/or warnings, but we did some work and we've done as much as we could with it).
@jbeda this is an item we'd like to discuss soon
My read on this would be that this is an internal or developer focused improvement. Would this add any value for end-users?
I'd also like to raise a discussion on prioritization for this. @ncdc I see you made it P1 - would love to know more, or we can discuss in the context of the larger 1.0 roadmap planning.
Would this add any value for end-users?
Currently, end users asking for help have to look in the 3 locations specified to see the full picture of what happened if a backup or restore didn't work as expected, so I'd say it does have end user impact.
Ah - I understand now. Thanks @nrb. So I'm lumping this under UX / debuggability.
Related to #305
Summarizing discussion from above, here's my straw-man proposal:
New: the object has not yet been processed by the Ark serverProcessing: the object is currently being processed by the Ark serverProcessed: the object has been processed by the Ark server and has reached a terminal state (e.g. for backups, everything that could be backed up was backed up and the tarball/etc have uploaded to object storage; for restores, everything that could be restored was restored and the log has been uploaded to object storage)Failed: there was a fatal error preventing the object from reaching a normal terminal state (e.g. there was an error uploading the tarball/logs to object storage)WarningCount and ErrorCount fields are added to the status for both Backups and Restores to store counts of the warnings/errors logged during the backup/restore; backups and restores can now be displayed in ark backup get / ark restore get as e.g. Processed (2 warnings, 1 error)ark backup describe and ark restore describe display warning/error counts by default. If --details is specified, the per-item log is fetched from backup storage, and the warning/error lines are displayed as part of the output.status.validationErrors) go away; any information currently reported as a validation error gets logged at error level to the per-item log file, and instead of using the FailedValidation phase, objects will be Processed with >0 errors.FailureReason field is added to the status for both Backups and Restores, to provide the user information about why their backup/restore is Failed rather than Processed (this is particularly useful for when the log fails to upload, and the only other place that errors can be logged currently is to the server log)All comments welcome -- just wanted to get a draft proposal out!
This proposal makes sense to me. I think FailureReason for cases where the failure is to upload a log makes a lot of sense in terms of visibility.
One thing I'm still thinking about is how to make it more clear that a backup that's failed validation has no data uploaded to object storage. E.g. a backup with a single validation error that therefore never backed up anything might be displayed as Processed (1 error) which could be the same as for a full-cluster backup that got one error backing up a single PV. We could:
FailedValidation as a phaseFailedValidation -> Failed, rather than Processed Processed (0 items backed up, 1 error)FWIW - was thinking about https://github.com/heptio/velero/issues/286#issuecomment-440386348 again, and I think FailedValidation should become Failed (rather than processed).
So, after starting to dig into this, I realized that this could turn into a pretty big ugly PR if tackled wholesale. I thought about how to break it down and here's what I came up with:
For v1.0:
RestoreResult https://github.com/heptio/velero/blob/master/pkg/apis/velero/v1/restore.go#L118 out of the API package, since it's not actually part of the API, just a struct that gets JSON'ed and uploaded to object storage. This frees us up to drop it in a 1.x release.velero restore describe, show warning/error counts right next to the status (so it'd show something like Phase: Completed (2 errors, 1 warning). Possibly also update velero restore get output to combine the phase/warnings/errors columns.Failed or Completed. Failed should mean "failed to start" or "failed to upload tarball", whereas Completed (with errors) would mean individual items failed to backup.For v1.x:
velero backup describe or velero backup logs to more easily view errors, by adding filtering, grouping, formatting. Do this for backups first, since we already have leveled logs.RestoreResult type for restores, move warning/error reporting into the restore log, and enable them to be shown in the same way as for BackupsFor v2.0:
ValidationErrors field on Backups/Restores (these will go into the log as regular errors), and drop the FailedValidation phase (this will become Failed).@nrb @carlisia does this seem reasonable? I tried to focus v1.0 work on (a) make things consistent between Backups/Restores so it's more obvious for the user, and (b) set us up for making additional changes later.
@nrb @carlisia would appreciate any input you have on the above - I'd like to get started on this soon.
I'll look at it in a little bit.
You didn't mention FailureReason, does this no longer make sense to add? It sounded like a good idea.
I like this proposal as is. I especially like rolling the FailedValidation into the Failed phase. Ultimately, I think the priority should be to have 1 phase that unequivocally means "success". If this is Processed, then we definitely should go for "FailedValidation -> Failed, rather than Processed". And I don't think it hurts if this change gets postponed to v2.0.
I'm confused about this Completed (2 errors, 1 warning). How does Phase: Completed relate to Phase: Processed? I found completed in the code but having trouble finding processed. In addition, in the code it says BackupPhaseCompleted means the backup has run successfully without errors. but your proposal says to display in the describe: Phase: Completed (2 errors, 1 warning). If there are errors, how could it be completed?
I'm asking to make sure we don't have discrepancy in meaning with our phases but also for my understanding.
Other then that tho, I like it that you broke out part of this effort into future releases, that is great. 馃憤
For the v1.0 milestone, how does the user then get this combined stream? velero restore logs? So describe only displays counts and no other information?
Let's chat about this in person next week, I think some discussion will be helpful.
@carlisia @nrb responses to your comments. We should still discuss in person.
You didn't mention FailureReason , does this no longer make sense to add? It sounded like a good idea.
Yeah, this still makes sense, and can be added for v1.0.
For the v1.0 milestone, how does the user then get this combined stream? velero restore logs? So describe only displays counts and no other information?
Based on the latest proposal, not much changes for v1.0 in terms of seeing the results. velero restore logs plus velero restore describe would still be used to see the overall picture. The work to combine into a single stream would be done in a v1.x.
Per https://github.com/heptio/velero/issues/286#issuecomment-479014480, the main work item that I actually want to tackle for v1.0 is deciding on the semantics of Completed vs. Failed, and then making it consistent across backups + restores. Specifically:
Completed (or Processed) mean "Completed with no errors"? Or can it mean "Completed with some errors backing up/restoring individual items"? Failed mean "There were >0 errors encountered during backup/restore"? Or does it mean "Fatal error, such as unable to upload/download backup tarball"?We could also decide to go to Completed, CompletedWithErrors, and Failed as the phase set, to differentiate.
I like Failed as "fatal error, couldn't perform a key operation to start the restore," and CompletedWithErrors or PartialCompletion for a restore where some resources can't be restored. The overlap I see there is if we failed to restore anything into k8s, but velero operations had no error, is that Failed, or CompletedWithErrors/PartialCompletion? If number of artifacts restore == 0, I could see how the latter would be confusing.
For me it hinges on how useful the output of a CompletedWithErrors would be. If 100% unusable, then it should be a plain Failed. I suspect there's a spectrum and maybe we can't even detect that.
Most helpful comment
And we could have our phases be New, InProgress, Failed (we couldn't start at all), Processed (maybe has errors and/or warnings, but we did some work and we've done as much as we could with it).