Velero: FE: Create a `velero debug` command for gathering troubleshooting information

Created on 13 Jul 2018 · 7Comments · Source: vmware-tanzu/velero

Describe the solution you'd like
Provide an ark debug subcommand that could output the following information:

ark client and server versions
ark pod logs (should probably include restic logs, too)
ark config
If a backup/restore name is provided:
* relevant backup or restore logs
* backup and/or restore YAML

Additionally, the ark debug command should provide a way to filter out sensitive information like:

bucket names
secrets
More?

This command would make it easier for users to file bug reports and get answers in a timely manner.

EnhancemenUser Epic P1 - Important Reviewed Q2 2021

Source

nrb

👍2

Most helpful comment

I agree they should be 2 separate things. ark bug must never expose confidential data and is really a convenience for getting you to a GitHub issue, possibly with some information filled in. ark debug may or may not contain confidential data - we should have a sufficient number of flags to allow you to control what is included in the debug tarball.

ncdc on 26 Jul 2018

👍3

All 7 comments

Great suggestion @nrb. We could also pair this command with https://github.com/heptio/ark/issues/578

rosskukulinski on 16 Jul 2018

Oh nice, I had missed that issue.

I definitely think linking the two would be a good idea.

nrb on 16 Jul 2018

I was thinking that ark debug and ark bug could be two separate things (or a set of flags on one command):

As an open source user, having a convenience command like ark bug to populate some info in a Github issue would be nice, but I wouldn't want to expose my config, logs, or sensitive data without me manually editing/copying/pasting into the issue.
As a customer paying for support, I could see ark debug generating a tarball of info that could be attached to a private helpdesk ticket. It's fine to try and scrub sensitive data, but you can't guarantee that the code won't miss a secret or two. We used a command like this to support Riak back in the day, the support team loved it.

Perhaps ark bug could include a --verbose and/or --tarball flag(s) to generate the more complete debug file.

metadave on 26 Jul 2018

👍1

ncdc on 26 Jul 2018

👍3

Wanted to drop the output of Tilt's tilt doctor cmd:

❯ tilt doctor
Tilt: v0.17.12, built 2020-11-19
System: darwin-amd64
---
Docker
- Host: [default]
- Version: 1.40
- Builder: 2
---
Kubernetes
- Env: kind-0.6+
- Context: kind-development
- Cluster Name: kind-development
- Namespace: velero
- Container Runtime: containerd
- Version: v1.18.2
- Cluster Local Registry: none
---
Thanks for seeing the Tilt Doctor!
Please send the info above when filing bug reports. 💗

The info below helps us understand how you're using Tilt so we can improve,
but is not required to ask for help.
---
Analytics Settings
--> (These results reflect your personal opt in/out status and may be overridden by an `analytics_settings` call in your Tiltfile)
- User Mode: opt-in
- Machine: b01f29c71f7ed63d15c1a67509c7c06d
- Repo: Z6GQn0TgYuYG6BNNif2f/A==

carlisia on 24 Nov 2020

I think for 1.7.0, we can start with this list:

Kubernetes version
Velero client and server versions
Velero pod logs, including restic
Velero Deployment
List of plugins
If a backup/restore name is provided:
- relevant backup or restore logs
- backup and/or restore YAML

These can be provided locally in a gzip or zip file as a first pass, allowing users to scrub data with their own tools. We can iterate on scrubbing within the functionality after that.

Ideally, this would be written for the client side, and could run against any version of Velero the server side.

nrb on 3 Feb 2021

After some experimentation, I think we can use https://github.com/vmware-tanzu/crash-diagnostics for this info. Here's a sample crashd script:

ns = "velero"
# Working dir for writing during script execution
crshd = crashd_config(workdir="{0}/crashd".format(os.home))
# Read the default kubeconfig, like velero
set_defaults(kube_config(path="{0}/.kube/config".format(os.home)))
capture_local(cmd="velero version")
# These need to go into functions due to Starklark limitations
# if args.backup:
    # backupLogsCmd = "velero backup logs {}".format(args.backup)
    # capture_local(cmd=backupLogsCmd)
# if args.restore:
    # restoreLogsCmd = "velero restore logs {}".format(args.restore)
    # capture_local(cmd=restoreLogsCmd)
kube_capture(what="logs", namespaces=[ns])
kube_capture(what="objects", kinds=["customresourcedefinitions"])
archive(output_file="diagnostics.tar.gz", source_paths=[crshd.workdir])

Some concerns around using crashd:

We'll need to figure out how to make sure Velero users have this, easily. Use packaging dependencies? Fetch it at runtime?
What do we do with offline installs?

nrb on 8 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings