Velero: FE: Create a `velero debug` command for gathering troubleshooting information

Created on 13 Jul 2018  ·  7Comments  ·  Source: vmware-tanzu/velero

Describe the solution you'd like
Provide an ark debug subcommand that could output the following information:

  • ark client and server versions
  • ark pod logs (should probably include restic logs, too)
  • ark config
  • If a backup/restore name is provided:
    * relevant backup or restore logs
    *
    backup and/or restore YAML

Additionally, the ark debug command should provide a way to filter out sensitive information like:

  • bucket names
  • secrets
  • More?

This command would make it easier for users to file bug reports and get answers in a timely manner.

EnhancemenUser Epic P1 - Important Reviewed Q2 2021

Most helpful comment

I agree they should be 2 separate things. ark bug must never expose confidential data and is really a convenience for getting you to a GitHub issue, possibly with some information filled in. ark debug may or may not contain confidential data - we should have a sufficient number of flags to allow you to control what is included in the debug tarball.

All 7 comments

Great suggestion @nrb. We could also pair this command with https://github.com/heptio/ark/issues/578

Oh nice, I had missed that issue.

I definitely think linking the two would be a good idea.

I was thinking that ark debug and ark bug could be two separate things (or a set of flags on one command):

  • As an open source user, having a convenience command like ark bug to populate some info in a Github issue would be nice, but I wouldn't want to expose my config, logs, or sensitive data without me manually editing/copying/pasting into the issue.

  • As a customer paying for support, I could see ark debug generating a tarball of info that could be attached to a private helpdesk ticket. It's fine to try and scrub sensitive data, but you can't guarantee that the code won't miss a secret or two. We used a command like this to support Riak back in the day, the support team loved it.

Perhaps ark bug could include a --verbose and/or --tarball flag(s) to generate the more complete debug file.

I agree they should be 2 separate things. ark bug must never expose confidential data and is really a convenience for getting you to a GitHub issue, possibly with some information filled in. ark debug may or may not contain confidential data - we should have a sufficient number of flags to allow you to control what is included in the debug tarball.

Wanted to drop the output of Tilt's tilt doctor cmd:

❯ tilt doctor
Tilt: v0.17.12, built 2020-11-19
System: darwin-amd64
---
Docker
- Host: [default]
- Version: 1.40
- Builder: 2
---
Kubernetes
- Env: kind-0.6+
- Context: kind-development
- Cluster Name: kind-development
- Namespace: velero
- Container Runtime: containerd
- Version: v1.18.2
- Cluster Local Registry: none
---
Thanks for seeing the Tilt Doctor!
Please send the info above when filing bug reports. 💗

The info below helps us understand how you're using Tilt so we can improve,
but is not required to ask for help.
---
Analytics Settings
--> (These results reflect your personal opt in/out status and may be overridden by an `analytics_settings` call in your Tiltfile)
- User Mode: opt-in
- Machine: b01f29c71f7ed63d15c1a67509c7c06d
- Repo: Z6GQn0TgYuYG6BNNif2f/A==

I think for 1.7.0, we can start with this list:

  • Kubernetes version
  • Velero client and server versions
  • Velero pod logs, including restic
  • Velero Deployment
  • List of plugins
  • If a backup/restore name is provided:

    • relevant backup or restore logs

    • backup and/or restore YAML

These can be provided locally in a gzip or zip file as a first pass, allowing users to scrub data with their own tools. We can iterate on scrubbing within the functionality after that.

Ideally, this would be written for the client side, and could run against any version of Velero the server side.

After some experimentation, I think we can use https://github.com/vmware-tanzu/crash-diagnostics for this info. Here's a sample crashd script:

ns = "velero"
# Working dir for writing during script execution
crshd = crashd_config(workdir="{0}/crashd".format(os.home))
# Read the default kubeconfig, like velero
set_defaults(kube_config(path="{0}/.kube/config".format(os.home)))
capture_local(cmd="velero version")
# These need to go into functions due to Starklark limitations
# if args.backup:
    # backupLogsCmd = "velero backup logs {}".format(args.backup)
    # capture_local(cmd=backupLogsCmd)
# if args.restore:
    # restoreLogsCmd = "velero restore logs {}".format(args.restore)
    # capture_local(cmd=restoreLogsCmd)
kube_capture(what="logs", namespaces=[ns])
kube_capture(what="objects", kinds=["customresourcedefinitions"])
archive(output_file="diagnostics.tar.gz", source_paths=[crshd.workdir])

Some concerns around using crashd:

  • We'll need to figure out how to make sure Velero users have this, easily. Use packaging dependencies? Fetch it at runtime?
  • What do we do with offline installs?
Was this page helpful?
0 / 5 - 0 ratings