Microk8s: root cert expired after a month, cluster does not respond anymore

Created on 27 Apr 2020  路  21Comments  路  Source: ubuntu/microk8s

running microk8s inspect does not work as well as talking to the cluster. error is this: x509: certificate has expired or is not yet valid

How can i renew the root cert?

How can i make it last longer than a month?

bug

Most helpful comment

The script I have for now is here: https://gist.github.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2

It would be great if anyone would run it and provide feedback.

curl https://gist.githubusercontent.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2/raw/f29ff555346435154553d35ff64a8282f867011f/refresh-certs.sh -o refresh.sh
chmod +x refresh.sh
sudo ./refresh.sh

After running the script the pods in the cluster should go into an unknown state and restart after some seconds.

The intention is to place the above script in a microk8s.refresh-certs command to address this issue in affected deployments.

@balchua the kubeconfig files use tokens but they also carry the ca.cert that is why I think they need to be recreated.

All 21 comments

I've seen this issue happen when the date of the machine changed. The certs duration is atleast 365 days.

but than even after 365 days there needs to be a solution on how to update the root ca, right?

There is a way to force generate the certs, by either changing the csr.conf.template say adding a DNS entry.
I have not personally tried but maybe cert manager can be used here.

Since i am currently not able to use my previous installation of microk8s, i did a reinstall using sudo snap install microk8s --classic --channel=1.18/stable and checked the root ca on the new cluster using openssl x509 -enddate -noout -in /var/snap/microk8s/current/certs/ca.crt

the result: notAfter=May 27 12:57:33 2020 GMT

again only valid for one month

and changing the csr.conf.template only seems to update server.crt and not the ca.crt

You're right, I just checked the code the CA cert is not specified with the -days param when requesting for a cert. I think this defaults to 30 days.
@ktsakalozos the ca.crt and the front-proxy-ca.crt is now 30 days old any particular reason why?
Thanks.

Same happened to my system yesterday. Is there no way to regenerate both ca.crt and server.crt so I can get back in action?

I have not tried this. 馃槉
Maybe we can try this:

  • Stop microk8s. microk8s.stop
  • Delete the .crt and .key in /var/snap/microk8s/current/certs.
  • Modify the file csr.conf.template by adding DNS or comment.
  • Start microk8s. microk8s.start
    You may want to backup the content of /var/snap/microk8s/current/certs first.

This in an oversight from our part (the missing -days arguments).

In the past hours we have patched the affected tracks, tested and released a new snap. This way any new deployments should not have this issue.

We will continue our work on a fix for the already existing deployments. The approach @balchua suggests seems promising but we will also need to recreate the kubeconfigs below line https://github.com/ubuntu/microk8s/blob/master/snap/hooks/install#L18

Thank you @ThomasSchoenbeck @EzraBrooks for reporting this and @balchua for spotting the issue and offering your help. Apologies for the inconvenience we may have caused.

@ktsakalozos i maybe wrong here, but i think the kubeconfigs are using tokens rather than certs.

I tried what balchua suggested (adding a DNS entry) but couldn't get it working. microk8s does not start.

Bummer. Its probably not regenerating the certs.

The script I have for now is here: https://gist.github.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2

It would be great if anyone would run it and provide feedback.

curl https://gist.githubusercontent.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2/raw/f29ff555346435154553d35ff64a8282f867011f/refresh-certs.sh -o refresh.sh
chmod +x refresh.sh
sudo ./refresh.sh

After running the script the pods in the cluster should go into an unknown state and restart after some seconds.

The intention is to place the above script in a microk8s.refresh-certs command to address this issue in affected deployments.

@balchua the kubeconfig files use tokens but they also carry the ca.cert that is why I think they need to be recreated.

@ktsakalozos aaa yes the CADATA needs to be repopulated. Totally overlooked it.

Sorry for the late response. I tested the script above. new certs are valid until the year 2030.

all pods went into unknown state. at around 30 seconds later, all pods went up.

Thanks for the help!

Now that there is a way to refresh the tokens, are we good to close this one?
Thanks

yes. i am closing the ticket! Thanks again for the greate help @balchua and @ktsakalozos

Also noting that the script @ktsakalozos provided fixes the issue for me. Thank you!

Was facing the same issue. The refresh.sh script worked for me. Afterwards I was facing DNS resolution errors. All services would crash with errors similar to

socket.gaierror: [Errno -3] Temporary failure in name resolution

To save others from 2 hours of debugging: Make sure that coredns has 1/1 ready in kubectl -n kube-system get all. Its readiness probe had failed and logs showed

E0524 09:50:35.607082       1 reflector.go:125] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to list *v1.Endpoints: Get https://10.152.183.1:443/api/v1/endpoints?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid

Deleting the pod (forcing it to restart) solved the issue for me.

Got the idea to check kube-system from here and https://github.com/ubuntu/microk8s/issues/332#issue-413517185.

I hit upon the same issue just now, had to run refresh.sh and also had to give the coredns pod a kick, thank you @PeterSR for sharing that.

Everything seems to be back to working order, however I cannot pull an image from a private repo now.

  Normal   Scheduled  17m                  default-scheduler  Successfully assigned homelab/newimage-66c8d88f65-lhvdz to kube
  Normal   Pulling    15m (x4 over 17m)    kubelet, kube      Pulling image "registry.gitlab.com/realg/kube/newimage:20.05"
  Warning  Failed     15m (x4 over 17m)    kubelet, kube      Failed to pull image "registry.gitlab.com/realg/kube/newimage:20.05": rpc error: code = Unknown desc = failed to resolve image "registry.gitlab.com/realg/kube/newimage:20.05": no available registry endpoint: failed to fetch anonymous token: unexpected status: 403 Forbidden
  Warning  Failed     15m (x4 over 17m)    kubelet, kube      Error: ErrImagePull
  Normal   BackOff    11m (x21 over 17m)   kubelet, kube      Back-off pulling image "registry.gitlab.com/realg/kube/newimage:20.05"
  Warning  Failed     113s (x65 over 17m)  kubelet, kube      Error: ImagePullBackOff

The image is definitely there, I can pull it with docker from another host using the same dockerconfig.json, I haven't made any other changes to my cluster so that has me thinking that it's related to refreshing the expired certs.

Has anyone had the same issue?

@realG

I actually had problems pulling images as well for a specific deployment, but then I realized that I had forgotten

imagePullSecrets:
- name: regcred

for that specific one and it was therefore unrelated to this issue. But I did recreate the regcred secret from scratch in an attempt to fix it, so who knows.

@PeterSR thank you for the tip yet again! I was indeed missing imagePullSecrets in the deployment yaml.

Was this page helpful?
0 / 5 - 0 ratings