Nixpkgs: etcd not init etcd.pem with services.kubernetes.roles master

Created on 12 Apr 2019 · 26Comments · Source: NixOS/nixpkgs

Issue description

When the services.kubernetes.roles = ["master"] is enabled, I have this error when starting the service etcd :

avril 12 18:10:01 xps15.px.io etcd[29989]: open /var/lib/kubernetes/secrets/etcd.pem: no such file or directory

Steps to reproduce

Use services.kubernetes.roles = ["master"] in /etc/nixos/configuration.nix

Technical details

system: "x86_64-linux"
host os: Linux 4.19.34, NixOS, 19.03.172138.5c52b25283a (Koi)
multi-user?: yes
sandbox: yes
version: nix-env (Nix) 2.2
channels(root): "nixos-19.03.172138.5c52b25283a"
nixpkgs: /nix/var/nix/profiles/per-user/root/channels/nixos

Source

apeyroux

👍2

Most helpful comment

@Icerius I just spent some time on this and it turns out to be a very simple fix for me - I did start with "127.0.0.1" and switched to "localhost" (I didn't RTFM first!)

Anyway - the fix for me was to delete the cached certs so basically

1.) Disable kubernetes (remove refs from /etc/nixos/)
2.) rm -rf /var/lib/cfssl /var/lib/kubernetes
3.) Enable kubernetes again (add refs back to /etc/nixos/)
4.) If the first build fails - run the rebuild again and it should succeed the second time.)

I have another problem now but that's unrelated so I'm all good on this one. :+1:

cawilliamson on 21 Apr 2019

👍5

All 26 comments

FWIW it's not happening on the 18.09 release

zarelit on 17 Apr 2019

cc @johanot @arianvp

zarelit on 18 Apr 2019

@zarelit that it's not happening on 18.09 is expected. We moved to mandatory pki in 19.03 (https://nixos.org/nixos/manual/index.html#sec-kubernetes)

However, this config _should_ work because "master" should imply easyCerts = true and bootstrap the certificates automatically.

arianvp on 18 Apr 2019

@apeyroux was the error transient or fatal? Did etcd eventually start up successfully with the certs, or did it just fail to start up at all?

We recently merged a PR (yesterday) that changes the order of components being started to reduce the amount of transient non-fatal errors https://github.com/NixOS/nixpkgs/pull/56789

There might be some time where the certs are still being generated, but etcd is already started. However, after the certs appear, etcd should start just fine

I can take upto serveral minutes for the kubernetes cluster to stabilise. Did it eventually complete bootstrapping?

arianvp on 18 Apr 2019

@arianvp I have the same issue, even though I don't know if the cause is the same. It looks like it's not transient

relevant parts of my configuration, trying to set up k8s on my laptop:

  services.kubernetes = {
   roles = ["master" "node"];
   addons.dashboard.enable = true;
   kubelet.extraOpts = "--fail-swap-on=false";
   masterAddress = "localhost";
  };

Certmgr unit is stuck in this loop

Apr 18 13:37:06 tsundoku systemd[1]: Starting certmgr...
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] certmgr: loading from config file /nix/store/q6s4lclkzmr1g59dqnjz6kdi6azqy8fj-certmgr.yaml
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] manager: loading certificates from /nix/store/ms3w33cai719d8971hsdmi4j21fs25pq-certmgr.d
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] manager: loading spec from /nix/store/ms3w33cai719d8971hsdmi4j21fs25pq-certmgr.d/addonManager.json
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for tsundoku.lan, not localhost"}
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: Failed: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for tsundoku.lan, not localhost"}
Apr 18 13:37:06 tsundoku systemd[1]: certmgr.service: Control process exited, code=exited status=1
Apr 18 13:37:06 tsundoku systemd[1]: certmgr.service: Failed with result 'exit-code'.
Apr 18 13:37:06 tsundoku systemd[1]: Failed to start certmgr.

Without any clue I tried to change the kubernetes.masterAddress from localhost to the hostname tsundoku.lan and now it complains like this:

Apr 18 13:43:05 tsundoku systemd[1]: Starting certmgr...
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] certmgr: loading from config file /nix/store/xy17dlkik1rcyvdxb6n6xa5fqq7hgdxk-certmgr.yaml
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] manager: loading certificates from /nix/store/d4njrd8r64mqgq4h6dxmbg6iysha5wgn-certmgr.d
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] manager: loading spec from /nix/store/d4njrd8r64mqgq4h6dxmbg6iysha5wgn-certmgr.d/addonManager.json
Apr 18 13:43:15 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:15 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://tsundoku.lan:8888/api/v1/cfssl/info: Post https://tsundoku.lan:8888/api/v1/cfssl/info: dial tcp: lookup tsundoku.lan: device or resource busy"}
Apr 18 13:43:15 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: Failed: {"code":7400,"message":"failed POST to https://tsundoku.lan:8888/api/v1/cfssl/info: Post https://tsundoku.lan:8888/api/v1/cfssl/info: dial tcp: lookup tsundoku.lan: device or resource busy"}
Apr 18 13:43:15 tsundoku systemd[1]: certmgr.service: Control process exited, code=exited status=1
Apr 18 13:43:15 tsundoku systemd[1]: certmgr.service: Failed with result 'exit-code'.
Apr 18 13:43:15 tsundoku systemd[1]: Failed to start certmgr.

zarelit on 18 Apr 2019

Has anyone figured out either a solution or a workaround for this at all?

I've been struggling to get a k8s cluster up on NixOS with this issue for a few days now. :(

The following seems to be the cause of the issue - a cert is being generated for "127.0.0.1" instead of "localhost" I guess?

# /nix/store/c1dcbf3c4jb4jlcadzh05i0di98lm6zz-unit-script-certmgr-pre-start
2019/04/18 21:38:39 [INFO] certmgr: loading from config file /nix/store/jvygi3li8pjmx0vf3jldamz8j3m1a03s-certmgr.yaml
2019/04/18 21:38:39 [INFO] manager: loading certificates from /nix/store/p4c93zcbdh9fcs634n1cn2scd5rwwjf0-certmgr.d
2019/04/18 21:38:39 [INFO] manager: loading spec from /nix/store/p4c93zcbdh9fcs634n1cn2scd5rwwjf0-certmgr.d/addonManager.json
2019/04/18 21:38:39 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for 127.0.0.1, not localhost"}
Failed: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for 127.0.0.1, not localhost"}

cawilliamson on 18 Apr 2019

Seems there are multiple (possibly unrelated) issues being raised here. Will try to look into them individually tomorrow, if someone else doesn't beat me to it :-).

Regarding easyCerts: It seemed less intrusive to not enable that option by default, in order not to mess with custom PKI-setups of existing clusters. I personally still prefer that easyCerts is opt-in, not opt-out. I would have expected a build failure though. IMHO, it is not nice that etcd fails at runtime, because of a missing certfile.

johanot on 19 Apr 2019

@johanot Even when I turn on easyCerts though sadly it still fails. The main problem seems to be the following error:

x509: certificate is valid for 127.0.0.1, not localhost

Sounds like the cert is being generated for an IP and not a "domain" (kind of.)

cawilliamson on 20 Apr 2019

I just went through setting up a kubernetes cluster on a new 19.03 install, started with errors and then a success. My master node is defined with only role master.

First masterAddress was defined as an IP address and then I got the error shown in the beginning of this isssue, that is no certificate got generated. When looking at cfssl logs there were errors about a "bad certificate" and "cannot validate certificate for [IP] because it doesn't contain any IP SANs".

Then I changed the masterAddress to be a hostname and got the error "x509: certificate is valid for [ip] not [host]".

Then I:

Removed all kubernetes references from the config, rebuilt and switched to config
Deleted /var/lib/kubernetes/secrets and /var/lib/cfssl folders
Readded kubernetes to the config with masterAddress as a hostname, rebuilt and switched to config

After that the certificates got generated and my kubernetes cluster seems to be running. I have also added another node to my cluster with role node through the nixos-kubernetes-node-join script.

Icerius on 20 Apr 2019

@Icerius Thanks for the info - I think, however, that the primary issue being discussed here is running a master and node on the same box. That's where things seem to fall apart (i.e. when your masterAddress is localhost.)

I'll spent some time today looking in to this and see if I can find a solution - this is really starting to bug me since I'm spending a lot of time with k8s lately and having it on my home server would be a helpful lab.

cawilliamson on 21 Apr 2019

Thanks @cawilliamson for the reply.

Based on the original bug report from @apeyroux it does not seem to be the case that he is running master and node on the same box since his he only specifies role master and he does not mention how he set masterAddress.

It is true that @zarelit seems to be running master and node on the same box with localhost as masterAddress.
A question for @zarelit, did you start your configuration with localhost or did you first try with masterAddress = "tsundoku.lan"
The reason I ask is that I see the message "certificate is valid for tsundoku.lan, not localhost" and when I got a similar message it was because the certificate that had been generated had incorrect info and I had to clean out the old certificates (the reason for step 2 in my description which I was able to do since it was a clean setup).
I see you tries setting masterAddress to tsundoku.lan after having it localhost, what does tsundoku.lan resolve too on the machine?

Regarding the error you are seeing @cawilliamson it seems to also be because of incorrectly generated certificates. Did you start with masterAddress localhost or start with 127.0.0.1 and then move to localhost?

Icerius on 21 Apr 2019

@Icerius I just spent some time on this and it turns out to be a very simple fix for me - I did start with "127.0.0.1" and switched to "localhost" (I didn't RTFM first!)

Anyway - the fix for me was to delete the cached certs so basically

I have another problem now but that's unrelated so I'm all good on this one. :+1:

cawilliamson on 21 Apr 2019

👍5

A question for @zarelit, did you start your configuration with localhost or did you first try with masterAddress = "tsundoku.lan"

At first I didn't read the docs and thought it was like the address where to bind something so (IIRC but I may be wrong) I put 127.0.0.1, then 0.0.0.0, then read the docs and put localhost, saw the messages and put tsundoku.lan at last.
At some point in these tests I have rolled the nixos version back and forth to/from 18.09 to fix other unrelated issues (i.e. machine with swap enabled, disk pressure alert)

The reason I ask is that I see the message "certificate is valid for tsundoku.lan, not localhost" and when I got a similar message it was because the certificate that had been generated had incorrect info and I had to clean out the old certificates (the reason for step 2 in my description which I was able to do since it was a clean setup).

Yeah I understand but I was with a friend in an ongoing "live tests" frenzy ^_^'' so I don't recall the exact steps to reproduce. I'm going to clear the the certificate cache and report

I see you tries setting masterAddress to tsundoku.lan after having it localhost, what does tsundoku.lan resolve too on the machine?

I thought I had tsundoku.lan in my /etc/hosts but I actually don't so what happened is that I tried to rebuild-switch both in a network that resolves it to my external address and in a network that does not resolve it at all

zarelit on 21 Apr 2019

@Icerius After clearing the data directories and rebuilding with masterAddress = "localhost"; I received an error about etcd again but it was transient and thus I believe it will be fixed when the I update the channel as @johanot pointed out.

zarelit on 21 Apr 2019

I have the same issue and it is fatal. I find certmgr is looping at a failure of no IP SAN.

My configuration is like

  services.kubernetes = {
   roles = ["master"];
   masterAddress = "10.0.5.2";
  };

Any idea to solve this problem?

onixie on 30 May 2019

@onixie master address needs to be a hostname, not an IP

Gonzih on 24 Aug 2019

👍2

@Gonzih
Thanks. hostname works for me.

onixie on 25 Oct 2019

Thank you for your contributions.
This has been automatically marked as stale because it has had no activity for 180 days.
If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.
Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the
related code and @ mention them in a comment.
Ask on the NixOS Discourse. 3. Ask on the #nixos channel on
irc.freenode.net.

stale[bot] on 1 Jun 2020

still important to me

saschagrunert on 23 Aug 2020

I have suspicion that #95885 (by bringing in https://github.com/golang/go/issues/39568) broke NixOS kubernetes modules 22 days ago. Looks like easyCerts automation is broken by stricter cert verification logic in Go 1.15

gleber on 11 Sep 2020

Yes, using nixpkgs from de5a644adf0ea226b475362cbe7e862789f2849d allows for the certmgr to talk to cfssl without errors.

The symptoms I've seen were:

certmgr showing this in logs:

Sep 11 16:26:11 luna certmgr-pre-start[2899]: 2020/09/11 16:26:11 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://api.kube:8888/api/v1/cfssl/info: Post \"https://api.kube:8888/api/v1/cfssl/info\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}
Sep 11 16:26:11 luna certmgr-pre-start[2899]: Failed: {"code":7400,"message":"failed POST to https://api.kube:8888/api/v1/cfssl/info: Post \"https://api.kube:8888/api/v1/cfssl/info\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}

cfssl showing this in logs:

Sep 11 16:32:19 luna cfssl[1734]: 2020/09/11 16:32:19 http: TLS handshake error from 192.168.1.21:33026: remote error: tls: bad>

My setup is following https://nixos.wiki/wiki/Kubernetes

gleber on 11 Sep 2020

@gleber I believe it should be fixed once #96446 is merged.

johanot on 11 Sep 2020

@johanot I can confirm that it is fixed, but NixOS-based tests using k8s I've tried were flaky. It happened to me that kube-apiserver would get marked as failed after a couple of restarts through the StartLimitIntervalSec/StartLimitBurst mechanism. It would fail to start due to certificates not being present yet in the right locations (it looks like certmgr is provisioning them with a delay, I do not yet understand what its mechanism of work). This would happen to me in 1/4 test runs when running test from https://github.com/xtruder/kubenix/blob/master/examples/nginx-deployment/default.nix#L24

gleber on 14 Sep 2020

~~I'm still having this issue and haven't been able to fix it, really curious if there is any progress.~~

EDIT: I did what @cawilliamson suggested but had to rebuild three times to get it to work - I have no idea why it worked

Rjected on 9 Nov 2020

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/use-nixos-as-single-node-kubernetes-cluster/8858/8

nixos-discourse on 9 Nov 2020

I had a lot of problems with this as well.

In https://discourse.nixos.org/t/use-nixos-as-single-node-kubernetes-cluster/8858/7?u=nobbz I learned that some files have to be deleted after a failed run. The list of those files comes from this thread basically.

Also, there I found out, that the field masterAddress has to be a string describing the hostname, it seems that an IP can not be used here. Additionally apiserver.advertiseAddress has to be an IP, not a hostname.

These are my observations. Not sure if changing those fields actually fixed it or was just coincidental, but after that it worked for me.