Microk8s: root cert expired after a month, cluster does not respond anymore

Created on 27 Apr 2020 · 21Comments · Source: ubuntu/microk8s

running microk8s inspect does not work as well as talking to the cluster. error is this: x509: certificate has expired or is not yet valid

How can i renew the root cert?

How can i make it last longer than a month?

bug

Source

ThomasSchoenbeck

👍5

Most helpful comment

The script I have for now is here: https://gist.github.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2

It would be great if anyone would run it and provide feedback.

curl https://gist.githubusercontent.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2/raw/f29ff555346435154553d35ff64a8282f867011f/refresh-certs.sh -o refresh.sh
chmod +x refresh.sh
sudo ./refresh.sh

After running the script the pods in the cluster should go into an unknown state and restart after some seconds.

The intention is to place the above script in a microk8s.refresh-certs command to address this issue in affected deployments.

@balchua the kubeconfig files use tokens but they also carry the ca.cert that is why I think they need to be recreated.

ktsakalozos on 28 Apr 2020

👍19 ❤11

All 21 comments

I've seen this issue happen when the date of the machine changed. The certs duration is atleast 365 days.

balchua on 27 Apr 2020

but than even after 365 days there needs to be a solution on how to update the root ca, right?

ThomasSchoenbeck on 27 Apr 2020

There is a way to force generate the certs, by either changing the csr.conf.template say adding a DNS entry.
I have not personally tried but maybe cert manager can be used here.

balchua on 27 Apr 2020

Since i am currently not able to use my previous installation of microk8s, i did a reinstall using sudo snap install microk8s --classic --channel=1.18/stable and checked the root ca on the new cluster using openssl x509 -enddate -noout -in /var/snap/microk8s/current/certs/ca.crt

the result: notAfter=May 27 12:57:33 2020 GMT

again only valid for one month

and changing the csr.conf.template only seems to update server.crt and not the ca.crt

ThomasSchoenbeck on 27 Apr 2020

You're right, I just checked the code the CA cert is not specified with the -days param when requesting for a cert. I think this defaults to 30 days.
@ktsakalozos the ca.crt and the front-proxy-ca.crt is now 30 days old any particular reason why?
Thanks.

balchua on 27 Apr 2020

Same happened to my system yesterday. Is there no way to regenerate both ca.crt and server.crt so I can get back in action?

EzraBrooks on 27 Apr 2020

I have not tried this. 😊
Maybe we can try this:

Stop microk8s. microk8s.stop
Delete the .crt and .key in /var/snap/microk8s/current/certs.
Modify the file csr.conf.template by adding DNS or comment.
Start microk8s. microk8s.start
You may want to backup the content of /var/snap/microk8s/current/certs first.

balchua on 27 Apr 2020

This in an oversight from our part (the missing -days arguments).

In the past hours we have patched the affected tracks, tested and released a new snap. This way any new deployments should not have this issue.

We will continue our work on a fix for the already existing deployments. The approach @balchua suggests seems promising but we will also need to recreate the kubeconfigs below line https://github.com/ubuntu/microk8s/blob/master/snap/hooks/install#L18

Thank you @ThomasSchoenbeck @EzraBrooks for reporting this and @balchua for spotting the issue and offering your help. Apologies for the inconvenience we may have caused.

ktsakalozos on 28 Apr 2020

@ktsakalozos i maybe wrong here, but i think the kubeconfigs are using tokens rather than certs.

balchua on 28 Apr 2020

I tried what balchua suggested (adding a DNS entry) but couldn't get it working. microk8s does not start.

andyshinn on 28 Apr 2020

Bummer. Its probably not regenerating the certs.

balchua on 28 Apr 2020

The script I have for now is here: https://gist.github.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2

It would be great if anyone would run it and provide feedback.

curl https://gist.githubusercontent.com/ktsakalozos/5de8d4c86c976eeef0242cc39fdf82b2/raw/f29ff555346435154553d35ff64a8282f867011f/refresh-certs.sh -o refresh.sh
chmod +x refresh.sh
sudo ./refresh.sh

After running the script the pods in the cluster should go into an unknown state and restart after some seconds.

The intention is to place the above script in a microk8s.refresh-certs command to address this issue in affected deployments.

@balchua the kubeconfig files use tokens but they also carry the ca.cert that is why I think they need to be recreated.

ktsakalozos on 28 Apr 2020

👍19 ❤11

@ktsakalozos aaa yes the CADATA needs to be repopulated. Totally overlooked it.

balchua on 28 Apr 2020

Sorry for the late response. I tested the script above. new certs are valid until the year 2030.

all pods went into unknown state. at around 30 seconds later, all pods went up.

Thanks for the help!

ThomasSchoenbeck on 4 May 2020

👍1

Now that there is a way to refresh the tokens, are we good to close this one?
Thanks

balchua on 4 May 2020

yes. i am closing the ticket! Thanks again for the greate help @balchua and @ktsakalozos

ThomasSchoenbeck on 4 May 2020

Also noting that the script @ktsakalozos provided fixes the issue for me. Thank you!

bmreading on 7 May 2020

👍1

Was facing the same issue. The refresh.sh script worked for me. Afterwards I was facing DNS resolution errors. All services would crash with errors similar to

socket.gaierror: [Errno -3] Temporary failure in name resolution

To save others from 2 hours of debugging: Make sure that coredns has 1/1 ready in kubectl -n kube-system get all. Its readiness probe had failed and logs showed

E0524 09:50:35.607082       1 reflector.go:125] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to list *v1.Endpoints: Get https://10.152.183.1:443/api/v1/endpoints?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid

Deleting the pod (forcing it to restart) solved the issue for me.

Got the idea to check kube-system from here and https://github.com/ubuntu/microk8s/issues/332#issue-413517185.

PeterSR on 24 May 2020

👍1

I hit upon the same issue just now, had to run refresh.sh and also had to give the coredns pod a kick, thank you @PeterSR for sharing that.

Everything seems to be back to working order, however I cannot pull an image from a private repo now.

  Normal   Scheduled  17m                  default-scheduler  Successfully assigned homelab/newimage-66c8d88f65-lhvdz to kube
  Normal   Pulling    15m (x4 over 17m)    kubelet, kube      Pulling image "registry.gitlab.com/realg/kube/newimage:20.05"
  Warning  Failed     15m (x4 over 17m)    kubelet, kube      Failed to pull image "registry.gitlab.com/realg/kube/newimage:20.05": rpc error: code = Unknown desc = failed to resolve image "registry.gitlab.com/realg/kube/newimage:20.05": no available registry endpoint: failed to fetch anonymous token: unexpected status: 403 Forbidden
  Warning  Failed     15m (x4 over 17m)    kubelet, kube      Error: ErrImagePull
  Normal   BackOff    11m (x21 over 17m)   kubelet, kube      Back-off pulling image "registry.gitlab.com/realg/kube/newimage:20.05"
  Warning  Failed     113s (x65 over 17m)  kubelet, kube      Error: ImagePullBackOff

The image is definitely there, I can pull it with docker from another host using the same dockerconfig.json, I haven't made any other changes to my cluster so that has me thinking that it's related to refreshing the expired certs.

Has anyone had the same issue?

realG on 26 May 2020

👍2

@realG

I actually had problems pulling images as well for a specific deployment, but then I realized that I had forgotten

imagePullSecrets:
- name: regcred

for that specific one and it was therefore unrelated to this issue. But I did recreate the regcred secret from scratch in an attempt to fix it, so who knows.

PeterSR on 26 May 2020

@PeterSR thank you for the tip yet again! I was indeed missing imagePullSecrets in the deployment yaml.

realG on 26 May 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings