Linkerd2: control plane installation debugging (pre/post flight check)

Created on 7 Jun 2018 · 22Comments · Source: linkerd/linkerd2

When installing the control plane, there is a collection of things that can cause problems. Many of these can be validated beforehand.

Once the control plane has been installed, there is a collection of things that can go wrong and make it difficult to operate linkerd2. Many of these can be validated after installation.

check is an awesome tool. It would be nice to extend this to have filters for the checks (pre and post install checks). Then, a new user could run check to make sure their cluster is ready to go (or fix any problems if it isn't).

CLI

Some example options and output. Note: this isn't meant to be all the checks, see further down for a list.

linkerd check --pre (success) - Limit the number of checks to only look at the ones relevant before installation.

==== Preflight checks ====
Cluster Config                              ✓
Cluster Connection                          ✓
Cluster Version                             ✓
Cluster Permissions                         ✓

linkerd check --pre (failure) - On failure, provide links to documentation that explain how to fix the problem.

==== Preflight checks ====
Cluster Config                              ✓
Cluster Connection                          ✓
Cluster Version                             ✓
Cluster Permissions                         X
    -- It appears that the currently active configuration does not have the correct RBAC to install linkerd. Take a look at https://linkerd.io/2/rbac/ for more details.

linkerd check --post --wait - Limit the number of checks to only look at ones relevant after installation and wait until they all pass (with a timeout).

==== Postflight checks ====
Cluster Service                             ✓
Service Health                              ✓
Service Version                             ✓

Checks

Preflight

can init k8s client
can query API
minimum version
has permissions
isn't already installed
has resource requirements
host kernel version support
cluster networking configuration
can pull images

Postflight

components are ready
can query api
up to date

arecli aredocs areusability prioritP0 roadmap stagproposal

Source

grampelberg

Most helpful comment

though I still like the one shot install

I think I'm against this idea. Install is intentionally zero-magic. Typing conduit can't feel risky. Having to type kubectl is an explicit acceptance of responsibility for whatever happens next. At the very least, I would not want the install command to take any action by default.

olix0r on 7 Jun 2018

👍2

All 22 comments

Instead of piping install into kubectl

I like the idea of automatically running post-install checks. However, I also think there's value in having install just output yaml rather than actually changing the cluster. The current approach allows the install output to be manually inspected (and changed) before it's applied, or saved to a file so it can be applied multiple times (which I've found useful for testing). I think these advantages of the current approach are worth maintaining.

Perhaps there should be two commands, one that outputs yaml as we do currently, and one that does the entire install process plus checks?

hawkw on 7 Jun 2018

@hawkw 100% agreed, I'd go for a flag on install conduit install -o yaml personally. Either that or a separate command is a requirement.

grampelberg on 7 Jun 2018

Instead of piping install into kubectl

Myself, I would never let conduit install modify my (production) cluster itself; i would always go through the conduit install > conduit-install.yml, manually review conduit-install.yml, kubectl apply -f conduit-install.yml process. One advantage of conduit install | kubectl apply -f - is that it hints that such a careful installation process is possible, succinctly.

In particular, we originally decided on the current approach because we wanted people to be able to see how we are changing the cluster by letting them inspect the yaml, because we thought that people wouldn't just trust conduit install to do something reasonable. Also at the time we thought people might version control the output of conduit install in order to keep a history of changes to the configurations.

We can change the approach so that the piping into kubectl is avoided, and provide a "output the yaml" option to let people inspect the configuration. However, this raises the question "Does the yaml completely describe everything that is done?;" i.e. it creates some confusion about what happens during install. Also, currently the conduit tool is always safe in the sense that it is read-only, i.e. it never changes anything (IIRC). Finally, you can save the conduit install output once and replay it to multiple clusters (e.g. in minikube on my laptop, and then in GKE after testing it) and you know that it is the same configuration. With conduit install doing the kubectl apply itself, it wouldn't be clear that it would install the same configuration each time it is given the same inputs.

Note that all of this applies to conduit inject too.

Anyway, don't interpret these comments as being -1 (or +1) to such changes. I just want to provide context regarding the original (current) design.

briansmith on 7 Jun 2018

Regarding the overall idea, I think it's important that we have a very good pre-install check mechanism (that can be done separately from the install), and a very good post-install check mechanism (that can be run separately from the install). Increasing the thoroughness of those checks is the most important improvement we can make, IMO. Finding problems (e.g. RBAC isn't enabled) and automatically generating a script that can fix those problems is the second most important thing we can do.

Having a "one step" install that does everything is harder for me to see the value in, because I don't think a one-step install is what one would do in a production deployment, realistically. A one-step install does look good in a demo because it makes conduit look easy to install, but it seems like pure demo-ware. We'd have to document the careful approach in addition to the one-step approach so it won't actually simplify the documentation to have a one-step option (default or otherwise).

briansmith on 7 Jun 2018

That's a great point, I had conflated everything a little bit too much (though I still like the one shot install). Maybe everything could be scoped to the check command?

conduit check --pre
conduit check --post

grampelberg on 7 Jun 2018

though I still like the one shot install

olix0r on 7 Jun 2018

👍2

`conduit check --pre`

can init k8s client
can query API
minimum version
has permissions
isn't already installed
has resource requirements
host kernel version support
cluster networking configuration
can pull images

`conduit check --post --wait`

components are ready
can query api
up to date

grampelberg on 8 Jun 2018

LGTM. Maybe --post should be the default? I slightly prefer conduit pre-install and conduit check but if you don't love those names then no use bike-shedding.

briansmith on 8 Jun 2018

I like conduit check running through everything by default, with --pre and --post just being filters (maybe conduit check --groups=pre,post?).

Having a separate command for running the pre-flight makes sense, especially for first users to find the functionality without docs, just feels like a little duplication maybe?

grampelberg on 8 Jun 2018

Good point about conduit check checking both pre and post, though perhaps the post-check should always check the pre- stuff anyway? I.e. "post" should always be a superset of "pre"?

briansmith on 8 Jun 2018

I dunno why not. I do like the idea of having groups in the future, something like:

# conduit check
==== Pre-Install ====

==== Post-Install ====

==== Ready for Upgrade ====

==== Upgrade Success ====

(Feels like we'd want some kind of pre/post-flight and upgrade automation as well)

grampelberg on 8 Jun 2018

Potential additions to conduit check --pre:

Host kernel version isn't among those affected by #982
Some kind of validation of the cluster's networking config?

hawkw on 9 Jun 2018

139 is something worth working on at the same time.

grampelberg on 12 Jun 2018

A potential check that might be helpful:

https://github.com/kubernetes/node-problem-detector

grampelberg on 19 Jul 2018

See #1421 for another thing to check (NetworkPolicy).

grampelberg on 10 Aug 2018

Sorry for chiming in late here. I really like the way that the spec for these checks has evolved. I just wanted to add:

Maybe we don't need to support a --post flag? We can certainly divide the checks up into pre-install and post-install (and only run the pre-install checks when the --pre flag is set), but I don't think it will ever make sense to run the post-install checks in isolation. The post-install checks will fail in weird ways if the pre-install checks fail, so we should validate that the pre-install checks pass as part of running the post-install checks

In other words:

linkerd check --pre
# runs only pre-install checks
# intended to be run before linkerd is installed

linkerd check
# runs both pre- and post-install checks
# intended to be run after linkerd is installed

That seems easier to grok from a user's perspective, too.

klingerf on 16 Aug 2018

👍1

Note that there are also a few pre-flight checks described above that we should only run when the --pre flag is present, and we should skip when running pre- and post-flight checks together. Namely:

has permissions to install
isn't already installed
has resource requirements
can pull images

klingerf on 16 Aug 2018

Maybe we don't need to support a --post flag?

Makes sense to me. Some part of me would like to have multiple sections of checks that you can run individually. I don't have a good use case for that though, so it is likely over-engineering at this point.

grampelberg on 16 Aug 2018

Ok, sounds good. And as part of #1417 we should consider at least grouping the check output into multiple sections. Down the road we could add support for running individual sections if that makes sense.

klingerf on 16 Aug 2018

I think there may be another scenario that we may need to think about and correct me if we do not need to worry about it. Should we also consider having an optional flag for linkerd install to also run these pre and post-flight checks? It sounds like a possible use case for this feature would be: Run linkerd check --pre then run linkerd install then linkerd check. It feels like there is some value in having this feature baked in the install process as an option.

dadjeibaah on 16 Aug 2018

@dadjeibaah the original suggestion was to have linkerd install just take care of it all for you (and wait until everything was healthy). The more I think about it, having install do nothing more than output some YAML is best behavior we can have.

Since most folks will do an install by following the getting started documentation, we can have them use the pre/post checks as part of the process and still have the ability for folks to see exactly what is happening to their cluster and potentially change it for their specific use cases.

grampelberg on 16 Aug 2018

I'm going to close this, since pre- and post-install checks are now available via the linkerd check command, and fleshing out the set of checks that we run pre- and post-install is ticketed in separately (e.g. #1474, #1475, #1732, #1741).

klingerf on 19 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add Support for PodDisruptionBudgets in Helm Chart

tustvold · 4Comments

Add validation to the New Service Profile popup form

alpeb · 3Comments

Wire up stats and dashboards for Jobs

klingerf · 3Comments

Document service discovery and load balancing

briansmith · 4Comments

Publish Helm chart to Helm Hub

ihcsim · 4Comments