Linkerd2: Full automation of control plane TLS cert rotation for clusters

Created on 26 Mar 2020  路  17Comments  路  Source: linkerd/linkerd2

Feature Request

While automating rotation of control plane TLS I noticed that there are three components that cannot be automated:

  • Proxy Injector
  • SP Validator
  • Tap service

The certs for these credentials can be set manually through the helm chart, but the cli doesn't provide anyway to set these.

What problem are you trying to solve?

Provide full automation of all tls certs through the cli and allow automating all tls secrets through cert-manager.

How should the problem be solved?

Just like there's an option to externalize the issuer certificates, the same options should be provided for the three components listed above so they can be also managed by an external component like cert-manager.

What do you want to happen? Add any considered drawbacks.

When these options are provided, the generated manifests will skip creating the opaque tls secrets for the three components above and also avoid injecting the caBundle for their corresponding webhooks and api service so they can be managed externally. CertManager provides options to inject the ca bundle for webhooks and api services, but I don't know of other tools besides cert manager.

The main drawback with providing these options is added complexity to the cli and the helm chart. It will also introduce a two-phase approach to installations as is done for an externally managed identity issuer.

Any alternatives you've considered?

The alternative to this option is to manually manage tls for these components through the helm chart. Once could follow the instructions described in the manual cert rotation doc, generate certs for these components and pass them to the helm variables

You're therefore left with an externally managed issuer cert, a manually managed one, or a mixture of the two.

Is there another way to solve this problem that isn't as good a solution?

I tried to naively replace the tls certs for these components, but the deployed application expects files that don't match a tls secret and thus cannot be updated. Updating the secret also leaves the caBundle in the webhook/apiservice unchanged and this breaks the api.

How would users interact with this feature?

If you can, explain how users will be able to use this. Maybe some sample CLI
output?

$ kubectl apply --validate=false -f <cert-manager-manifest>.yml # Install cert manager
$ kubectl create ns linkerd
$ kubectl apply -f <cert-manager-linkerd-certs-manifest>.yml # create linkerd certificates
$ linkerd install --ha --identity-external-issuer \
        --tap-external-issuer \
        --proxy-injector-external-issuer \
        --sp-validator-external-issuer > manifest.yml
$ # Edit manifest.yml and annotate webhooks and apiserver with 

Edit manifest.yml and annotate webhooks and apiservice like so:

...
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  name: linkerd-proxy-injector-webhook-config
  annotations: # This was added here to automate `caBundle`
    cert-manager.io/inject-ca-from: linkerd/linkerd-proxy-injector-tls
...

Now apply the manifest with the annotation updates

$ kubectl apply -f manifest.yml

This should fully automate all cert-manager components even as far as the trust anchor.

I've written up some code that attempts to do all this with kustomize, but with the annotations commented out until these features are possibly implemented in some form - https://github.com/misakwa/kustomized-linkerd

wontfix

Most helpful comment

If we did want to entertain the idea of externally provided and rotated trust anchors (out of scope for my use case for now), then we'll just need to update the Linkerd chart to allow us to specify the annotations in a values.yaml, which is a common configuration option for other helm charts. Again, this will also keep the manifest in sync with an infrastructure-as-code repo

(I'm referring to)

apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  name: linkerd-proxy-injector-webhook-config
    annotations:
        cert-manager.io/inject-ca-from: linkerd/linkerd-proxy-injector-tls # Add this

All 17 comments

Here are some other alternatives to consider:

  1. Try annotate the webhook and apiservice configuration with certmanager.k8s.io/inject-ca-from. Last I tried with Kubebuilder, this will notify cert-manager to inject the specified certificate into the configuration
  2. Last I checked, helm upgrade will recreate the caBundle inside the webhook and apiservice configuration
  3. As of K8s 1.14-ish, if you leave out the caBundle in the configuration, the api-server trust root will be used

The shortcoming with these approaches is that the webhook and tap pods still need to be restarted when the managed caBundle is rotated.

  1. Try annotate the webhook and apiservice configuration with certmanager.k8s.io/inject-ca-from. Last I tried with Kubebuilder, this will notify cert-manager to inject the specified certificate into the configuration

Yeah. That will only partially update the cert references . The deployment backing the webhook/apiservice will still read from the secret . That secret will need to be updated with whatever was injected from cert-manager.

  1. Last I checked, helm upgrade will recreate the caBundle inside the webhook and apiservice configuration

I'll have to test this out. Does linkerd upgrade ... also do the same?

  1. As of K8s 1.14-ish, if you leave out the caBundle in the configuration, the api-server trust root will be used

TIL.

@ihcsim you're right about point 2. helm upgrade does regenerate the certs and restart the pods. I think its quite useful actually. Unfortunately I wasn't able to get linkerd upgrade ... to do the same.

There's still no way to automatically generate these certs without running a helm upgrade ... though.

@ihcsim so I created a diff that is a WIP of what I'm hoping to achieve here: https://github.com/linkerd/linkerd2/compare/master...misakwa:misakwa/externalize-tls?expand=1

It updates the install and upgrade commands and adds a new option --tls-manager that can be used to control whether the tls resources are created or not.

I've also updated the sample project that I created to fully automate the tls here: https://github.com/misakwa/linkerd-kustomized. With this new option I can easily use cert manager to fully manage all tls resources from the trust anchor down to tls secrets used by the webhooks and api services.

Thoughts?

@misakwa Linkerd 2.7 added the ability to use external PKI providers for managing the certs used for supporting the automatic mTLS feature between services, which is a critical feature, and thus required the extra "configurability".
Note however that the certs for the webhooks and APIServices are used by some of Linkerd's control plane components to authenticate against the k8s apiserver and have nothing to do with Linkerd's identity story. Those certs are automatically generated and last for a year. Besides being able to rotate them to avoid them to expire, could you please explain what's your use case and how do you benefit from having an external PKI manage them as well?

Thanks @alpeb!

I've recently deployed linkerd on our infrastructure and one of the main areas that I had to deal with was with tls rotations. I realized that being able to keep those updated can be a hassle sometimes. Even though most deployments do not have to worry about it until the 1 year mark, one can easily render the cluster unstable if some of the tls components are allowed to expire.

In my current setup, I've hand rolled a few scripts to make it easier for every member of the team to be able to rotate the certificates, but doing that for the very first time can be daunting if you're not familiar with the process and haven't done it before. I used cert manager for the issuer, but had to script the other components.

My main goal is to be able to run a linkerd deployment with zero work on maintenance except for security fixes or upgrade to add new features that I (or my employer) may need. I would like to not worry about the proxy injector not working after 1 year if I can automate it. In the end I guess I'm looking for something that resembles a fully managed solution - not only the issuer cert is managed, but every single tls cert is externalized.

I've attempted to produce an installation of linkerd with all tls components managed though cert-manager here: https://github.com/misakwa/linkerd-kustomized/tree/0.2-pre.

Thanks again for taking a look at this. I'll do my best to answer any questions you have. Hopefully we can find a way to make this easier for everyone else in the future even if its not exactly this one.

Hi @alpeb

To add my use case for this issue and PR.

We strictly follow gitops and infrastructure-as-code. We currently have the manifest that's generated by linkerd install ... saved into git. This manifest includes private keys for the webhooks, which isn't ideal and something I would like to move away from.

I've toyed with the linkerd helm chart for 2.7.0 and using cert-manager to generate the identity certificates, however because we're using gitops/infrastructure-as-code, we use ArgoCD to continuously reconcile the Kubernetes state with git. This involves ArgoCD running helm template ... every few minutes.

Due to the way helm works, running helm template ... generates new webhook certs/keys every time unless they are defined in the values.yaml file. However, that brings me back to storing the keys in git, which is what I'm trying to get away from.

Being able to use cert-manager to also manage the webhook certificates will avoid our tooling to constantly think the Linkerd installation is out of sync with git

@jon-walton my biggest concern here is introducing a hard dependency on cert-manager. AFAIK it is the only tool that can change the MWC configurations (which will immediately go out of sync with git on you again). Is there a way to have the MWC/APIServices consume secrets without requiring modification of the base assets?

@grampelberg Looking at the PR, if .Values.global.tlsManager is not set to internal, then either the tls secrets are not created, or the cert/key/ca are not added to the secret (identity, proxy injector, webhooks, etc)

We can test this using helm template.

Taken from https://linkerd.io/2/tasks/generate-certificates/

step certificate create identity.linkerd.cluster.local ca.crt ca.key \
--profile root-ca --no-password --insecure

step certificate create identity.linkerd.cluster.local issuer.crt issuer.key --ca ca.crt --ca-key ca.key --profile intermediate-ca --not-after 8760h --no-password --insecure

cd charts/linkerd2
helm dep up

helm template linkerd2 \
--set-file global.identityTrustAnchorsPEM=ca.crt \
--set-file identity.issuer.tls.crtPEM=issuer.crt \
--set-file identity.issuer.tls.keyPEM=issuer.key \
--set identity.issuer.crtExpiry='2021-04-09T01:05:50Z' . > manifest.yaml

helm template linkerd2 \
--set-file global.identityTrustAnchorsPEM=ca.crt \
--set-file identity.issuer.tls.crtPEM=issuer.crt \
--set-file identity.issuer.tls.keyPEM=issuer.key \
--set identity.issuer.crtExpiry='2021-04-09T01:05:50Z' . > manifest-again.yaml

diff manifest.yaml manifest-again.yaml

we get different certs/keys, which causes automation to go into a constant reconciliation loop resulting in new certs/keys getting deployed and linkerd pods restarting every few minutes

If we take the PR and do almost the same...

step certificate create identity.linkerd.cluster.local ca.crt ca.key \
--profile root-ca --no-password --insecure

cd charts/linkerd2
helm dep up

helm template linkerd2 \
--set-file global.identityTrustAnchorsPEM=ca.crt \
--set global.tlsManager=external \
--set identity.issuer.scheme=kubernetes.io/tls \
. > manifest.yaml

helm template linkerd2 \
--set-file global.identityTrustAnchorsPEM=ca.crt \
--set global.tlsManager=external \
--set identity.issuer.scheme=kubernetes.io/tls \
. > manifest-again.yaml

diff -s manifest.yaml manifest-again.yaml
Files manifest.yaml and manifest-again.yaml are identical

So with @misakwa 's PR, the manifests are identical and stay in sync with git/kubernetes which means automation (such as ArgoCD) stays happy and doesn't go into a reconciliation loop.

With regards to a hard dependency on cert-manager, I disagree that it's a hard dependency. I think of it as an optional soft dependency. It's up to the cluster operator if they want externally managed certs, Linkerd doesn't care where they come from, it just wants the certs in a secret. If those certs happen to come from cert-manager, that's up to the cluster operator?

As you said, cert-manager (also AFAIK) is the only tool that can update the webhook ca, but there's nothing stopping someone from creating another tool to do that?

However, that's only if the cluster operator wants to automatically generate the trust anchor. As far as I can see, the chart in the PR doesn't support an externally created trust anchor yet, pending a discussion around the proxy picking it up, so that's not a blocker with regards to a dependency on cert-manager for now?

As per our previous discussions, it's not a big deal if the trust anchor public key is in git. Linkerd doesn't need the private key, only cert-manager and we can provide that separately to the Linkerd deployment.

If we did want to entertain the idea of externally provided and rotated trust anchors (out of scope for my use case for now), then we'll just need to update the Linkerd chart to allow us to specify the annotations in a values.yaml, which is a common configuration option for other helm charts. Again, this will also keep the manifest in sync with an infrastructure-as-code repo

(I'm referring to)

apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  name: linkerd-proxy-injector-webhook-config
    annotations:
        cert-manager.io/inject-ca-from: linkerd/linkerd-proxy-injector-tls # Add this

@jon-walton I see, I was confused. This is all so complicated! Let me walk through the pieces and make sure we're on the same page.

  • MWC resources need a caBundle or the destinations need to use certificates signed by the api-server.
  • Linkerd must create the MWC resources because there's Linkerd specific configuration in there.
  • cert-manager has the ability to update MWC resources and inject the correct caBundle.

This leads to my question: if cert-manager updates the caBundle, won't ArgoCD start updating the resource as it changed from what's in git?

@grampelberg I believe we're on the same page.

If the helm chart doesn't render caBundle into the manifest, ArgoCD won't think it's out of sync. Under the hood, Argocd does a diff (I'm not entirely sure on how it does that), and applies with helm template ... | kubectl apply. I'm assuming the same logic apply uses is used in the diff. If a manifest has been edited after a kubectl apply to add an item, apply won't delete it again. https://kubernetes.io/docs/tasks/manage-kubernetes-objects/declarative-config/#how-apply-calculates-differences-and-merges-changes

Have a look at https://github.com/jon-walton/linkerd2-gitops-example

It shows the basic steps of standing up a cluster and installing linkerd automatically. Some parts are simplified for the example, hopefully it gives you the gist of what's going on.

The helm chart in that repo is packaged from #4232 with a slight change to allow me to specify annotations on the apiserver and webhooks. You'll want to build the image for that PR and update the helm values (in apps/linkerd.yaml) to point it at the right image.

Once deployed, the first sync might fail (timing with the cert-manager certs being put into a degraded state because they haven't been issued yet), just do a manual sync in the ui.

You'll see that cert-manager successfully injected the caBundle and if deploying a linkerd image from that PR, everything should come up

@jon-walton I like the idea of allowing annotations in the chart. It makes it simple enough for everyone else to use without an additional customization (kustomization.yml) layer. Allowing annotations is also a standard approach like for some helm charts. I'll definitely consider this after we iron out the implementation fully.

@jon-walton ahhha! Thank you! I'm a little surprised that works, I've spent quite a lot of time fighting helm's "apply" or complete lack thereof.

I think my main concern can be taken care of by adding a check that errors on the caBundle not being in place so users can have some kind of idea what state they've gotten themselves into and how to get out of it.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Should this issue have been closed?

Yes @chris13524. At least one of the issues has been resolved and the original pr needs to be split apart into easily managable bits.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tustvold picture tustvold  路  4Comments

ihcsim picture ihcsim  路  4Comments

briansmith picture briansmith  路  4Comments

ihcsim picture ihcsim  路  4Comments

olix0r picture olix0r  路  3Comments