This is using etcd-operator 0.4.1 in K8S 1.6.7 to create the cluster with TLS. When creating the cluster with ETCD 3.1.8 the TLS certificates used for peer communication work without issue. When bumping the cluster to 3.2+ (3.2 and 3.2.3 tested) it results in a failed cluster. Also provisioning a new cluster in 3.2 fails with the same certificates. I am using the current method of generating the ETCD certificates from terraform-installer.
The heart of the issue seems to be that 3.2 is rejecting the IP address of the connecting peer since it doesn't match any of the DNS names in the certificate. Given the IP addresses of the ETCD members in a K8S deploy will be unknown until provision time, it seems like the peer check is overly restrictive and IP matching should only occur when the certificate explictly provides IPs to validate. I modified my generation to not include IPs in the certificate and it still fails. When providing the exact IPs that will be used in the certificate it also fails because it still is looking at the DNS names instead of the IP addresses in the certificate.
2017-07-16 07:03:29.044100 I | etcdmain: rejected connection from "10.2.0.58:44986" (tls: "10.2.0.58" does not match any of DNSNames ["*.portworx-etcd.portworx.svc.cluster.local" "portworx-etcd-client.portworx.svc.cluster.local"])
(3.2 and 3.2.3 tested)
Can you double-check the etcd version?
Seems duplicate with https://github.com/coreos/etcd/issues/8206, fix https://github.com/coreos/etcd/pull/8223 is included in 3.2.3.
/cc @heyitsanthony
should have been fixed in 40468ab11f720e87ab853d05f6362f6f02c93689...
@cehoffman what does openssl x509 -in peer.crt -text -noout give for X509v3 Subject Alternative Name:?
I just double checked the test again and it for sure is using 3.2.3.
First member log
```2017-07-18 17:10:38.470733 I | etcdmain: etcd Version: 3.2.3
2017-07-18 17:10:38.470792 I | etcdmain: Git SHA: ae23b0e
2017-07-18 17:10:38.470807 I | etcdmain: Go Version: go1.8.3
2017-07-18 17:10:38.470810 I | etcdmain: Go OS/Arch: linux/amd64
2017-07-18 17:10:38.470813 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-07-18 17:10:38.470841 I | embed: peerTLS: cert = /etc/etcdtls/member/peer-tls/peer.crt, key = /etc/etcdtls/member/peer-tls/peer.key, ca = , trusted-ca = /etc/etcdtls/member/peer-tls/peer-ca.crt, client-cert-auth = true
2017-07-18 17:10:38.471734 I | embed: listening for peers on https://0.0.0.0:2380
2017-07-18 17:10:38.471780 I | embed: listening for client requests on 0.0.0.0:2379
2017-07-18 17:10:38.487360 I | pkg/netutil: resolving portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2380 to 10.2.5.80:2380
2017-07-18 17:10:38.487463 I | pkg/netutil: resolving portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2380 to 10.2.5.80:2380
2017-07-18 17:10:38.487519 I | etcdserver: name = portworx-etcd-0000
2017-07-18 17:10:38.487545 I | etcdserver: data dir = /var/etcd/data
2017-07-18 17:10:38.487555 I | etcdserver: member dir = /var/etcd/data/member
2017-07-18 17:10:38.487558 I | etcdserver: heartbeat = 100ms
2017-07-18 17:10:38.487561 I | etcdserver: election = 1000ms
2017-07-18 17:10:38.487564 I | etcdserver: snapshot count = 100000
2017-07-18 17:10:38.487576 I | etcdserver: advertise client URLs = https://portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2379
2017-07-18 17:10:38.487581 I | etcdserver: initial advertise peer URLs = https://portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2380
2017-07-18 17:10:38.487588 I | etcdserver: initial cluster = portworx-etcd-0000=https://portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2380
2017-07-18 17:10:38.506153 I | etcdserver: starting member 76c1c84d62eb4d29 in cluster 95134594ee2c53b4
2017-07-18 17:10:38.506182 I | raft: 76c1c84d62eb4d29 became follower at term 0
2017-07-18 17:10:38.506190 I | raft: newRaft 76c1c84d62eb4d29 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017-07-18 17:10:38.506194 I | raft: 76c1c84d62eb4d29 became follower at term 1
2017-07-18 17:10:38.523959 W | auth: simple token is not cryptographically signed
2017-07-18 17:10:38.534836 I | etcdserver: starting server... [version: 3.2.3, cluster version: to_be_decided]
2017-07-18 17:10:38.534878 I | embed: ClientTLS: cert = /etc/etcdtls/member/server-tls/server.crt, key = /etc/etcdtls/member/server-tls/server.key, ca = , trusted-ca = /etc/etcdtls/member/server-tls/server-ca.crt, client-cert-auth = true
2017-07-18 17:10:38.536391 I | etcdserver/membership: added member 76c1c84d62eb4d29 [https://portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2380] to cluster 95134594ee2c53b4
2017-07-18 17:10:39.206416 I | raft: 76c1c84d62eb4d29 is starting a new election at term 1
2017-07-18 17:10:39.206493 I | raft: 76c1c84d62eb4d29 became candidate at term 2
2017-07-18 17:10:39.206562 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 2
2017-07-18 17:10:39.206594 I | raft: 76c1c84d62eb4d29 became leader at term 2
2017-07-18 17:10:39.206606 I | raft: raft.node: 76c1c84d62eb4d29 elected leader 76c1c84d62eb4d29 at term 2
2017-07-18 17:10:39.206972 I | etcdserver: published {Name:portworx-etcd-0000 ClientURLs:[https://portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2379]} to cluster 95134594ee2c53b4
2017-07-18 17:10:39.207093 I | etcdserver: setting up the initial cluster version to 3.2
2017-07-18 17:10:39.207161 I | embed: ready to serve client requests
2017-07-18 17:10:39.207519 I | embed: serving client requests on [::]:2379
2017-07-18 17:10:39.212803 N | etcdserver/membership: set the initial cluster version to 3.2
2017-07-18 17:10:39.212907 I | etcdserver/api: enabled capabilities for version 3.2
2017-07-18 17:10:39.217143 I | etcdserver/api/v3rpc: Failed to dial 0.0.0.0:2379: connection error: desc = "transport: remote error: tls: bad certificate"; please retry.
2017-07-18 17:10:40.625575 I | etcdserver/membership: added member bca0434ec5dbf601 [https://portworx-etcd-0001.portworx-etcd.portworx.svc.cluster.local:2380] to cluster 95134594ee2c53b4
2017-07-18 17:10:40.625625 I | rafthttp: starting peer bca0434ec5dbf601...
2017-07-18 17:10:40.625640 I | rafthttp: started HTTP pipelining with peer bca0434ec5dbf601
2017-07-18 17:10:40.627070 I | rafthttp: started streaming with peer bca0434ec5dbf601 (writer)
2017-07-18 17:10:40.628238 I | rafthttp: started streaming with peer bca0434ec5dbf601 (writer)
2017-07-18 17:10:40.629623 I | rafthttp: started peer bca0434ec5dbf601
2017-07-18 17:10:40.629667 I | rafthttp: added peer bca0434ec5dbf601
2017-07-18 17:10:40.629952 I | rafthttp: started streaming with peer bca0434ec5dbf601 (stream MsgApp v2 reader)
2017-07-18 17:10:40.630166 I | rafthttp: started streaming with peer bca0434ec5dbf601 (stream Message reader)
2017-07-18 17:10:42.206589 W | raft: 76c1c84d62eb4d29 stepped down to follower since quorum is not active
2017-07-18 17:10:42.206634 I | raft: 76c1c84d62eb4d29 became follower at term 2
2017-07-18 17:10:42.206647 I | raft: raft.node: 76c1c84d62eb4d29 lost leader 76c1c84d62eb4d29 at term 2
2017-07-18 17:10:43.406452 I | raft: 76c1c84d62eb4d29 is starting a new election at term 2
2017-07-18 17:10:43.406573 I | raft: 76c1c84d62eb4d29 became candidate at term 3
2017-07-18 17:10:43.406597 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 3
2017-07-18 17:10:43.406628 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 3
2017-07-18 17:10:45.006451 I | raft: 76c1c84d62eb4d29 is starting a new election at term 3
2017-07-18 17:10:45.006478 I | raft: 76c1c84d62eb4d29 became candidate at term 4
2017-07-18 17:10:45.006487 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 4
2017-07-18 17:10:45.009878 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 4
2017-07-18 17:10:45.629933 W | rafthttp: health check for peer bca0434ec5dbf601 could not connect: dial tcp 10.2.2.77:2380: getsockopt: connection refused
2017-07-18 17:10:46.306461 I | raft: 76c1c84d62eb4d29 is starting a new election at term 4
2017-07-18 17:10:46.306489 I | raft: 76c1c84d62eb4d29 became candidate at term 5
2017-07-18 17:10:46.306498 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 5
2017-07-18 17:10:46.306540 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 5
2017-07-18 17:10:46.927812 I | etcdmain: rejected connection from "10.2.2.77:53656" (tls: "10.2.2.77" does not match any of DNSNames ["*.portworx-etcd.portworx.svc.cluster.local" "portworx-etcd-client.portworx.svc.cluster.local"])
2017-07-18 17:10:48.206455 I | raft: 76c1c84d62eb4d29 is starting a new election at term 5
2017-07-18 17:10:48.206503 I | raft: 76c1c84d62eb4d29 became candidate at term 6
2017-07-18 17:10:48.206513 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 6
2017-07-18 17:10:48.206581 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 6
2017-07-18 17:10:50.106471 I | raft: 76c1c84d62eb4d29 is starting a new election at term 6
2017-07-18 17:10:50.106566 I | raft: 76c1c84d62eb4d29 became candidate at term 7
2017-07-18 17:10:50.106604 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 7
2017-07-18 17:10:50.106635 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 7
2017-07-18 17:10:50.636397 W | rafthttp: health check for peer bca0434ec5dbf601 could not connect: dial tcp 10.2.2.77:2380: i/o timeout
2017-07-18 17:10:51.106761 I | raft: 76c1c84d62eb4d29 is starting a new election at term 7
2017-07-18 17:10:51.106808 I | raft: 76c1c84d62eb4d29 became candidate at term 8
2017-07-18 17:10:51.106825 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 8
2017-07-18 17:10:51.106840 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 8
2017-07-18 17:10:52.706798 I | raft: 76c1c84d62eb4d29 is starting a new election at term 8
2017-07-18 17:10:52.706847 I | raft: 76c1c84d62eb4d29 became candidate at term 9
2017-07-18 17:10:52.706858 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 9
2017-07-18 17:10:52.706866 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 9
2017-07-18 17:10:52.998857 W | etcdserver: timed out waiting for read index response
2017-07-18 17:10:54.206413 I | raft: 76c1c84d62eb4d29 is starting a new election at term 9
2017-07-18 17:10:54.206440 I | raft: 76c1c84d62eb4d29 became candidate at term 10
2017-07-18 17:10:54.206449 I | raft: 76c1c84d62eb4d29 received MsgVoteResp from 76c1c84d62eb4d29 at term 10
2017-07-18 17:10:54.206455 I | raft: 76c1c84d62eb4d29 [logterm: 2, index: 5] sent MsgVote request to bca0434ec5dbf601 at term 10
2017-07-18 17:10:55.636647 W | rafthttp: health check for peer bca0434ec5dbf601 could not connect: dial tcp 10.2.2.77:2380: i/o timeout
2nd member
```2017-07-18 17:10:46.878003 I | etcdmain: etcd Version: 3.2.3
2017-07-18 17:10:46.878050 I | etcdmain: Git SHA: ae23b0e
2017-07-18 17:10:46.878054 I | etcdmain: Go Version: go1.8.3
2017-07-18 17:10:46.878056 I | etcdmain: Go OS/Arch: linux/amd64
2017-07-18 17:10:46.878060 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-07-18 17:10:46.878098 I | embed: peerTLS: cert = /etc/etcdtls/member/peer-tls/peer.crt, key = /etc/etcdtls/member/peer-tls/peer.key, ca = , trusted-ca = /etc/etcdtls/member/peer-tls/peer-ca.crt, client-cert-auth = true
2017-07-18 17:10:46.879156 I | embed: listening for peers on https://0.0.0.0:2380
2017-07-18 17:10:46.879239 I | embed: listening for client requests on 0.0.0.0:2379
2017-07-18 17:10:46.929853 W | etcdserver: could not get cluster response from https://portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2380: Get https://portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local:2380/members: EOF
2017-07-18 17:10:46.971235 I | etcdmain: rejected connection from "10.2.5.80:55984" (tls: "10.2.5.80" does not match any of DNSNames ["*.portworx-etcd.portworx.svc.cluster.local" "portworx-etcd-client.portworx.svc.cluster.local"])
2017-07-18 17:10:46.971269 I | etcdmain: rejected connection from "10.2.5.80:55982" (tls: "10.2.5.80" does not match any of DNSNames ["*.portworx-etcd.portworx.svc.cluster.local" "portworx-etcd-client.portworx.svc.cluster.local"])
2017-07-18 17:10:46.971279 C | etcdmain: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls
openssl x509 -in peer.crt -text -noout
```Certificate:
Data:
Version: 3 (0x2)
Serial Number:
bd:6f:a2:1d:b0:11:09:97:63:63:62:48:de:62:a7:05
Signature Algorithm: sha256WithRSAEncryption
Issuer: C=, ST=, L=/postalCode=, O=etcd, OU=, CN=etcd-ca
Validity
Not Before: Jul 16 07:01:14 2017 GMT
Not After : Aug 21 19:01:14 2017 GMT
Subject: C=, ST=, L=/postalCode=, O=etcd, OU=, CN=etcd
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
RSA Public Key: (2048 bit)
Modulus (2048 bit):
00:ad:b6:8c:c9:2e:ac:fe:6d:99:83:e9:75:69:bb:
ff:2d:51:c6:d9:f8:db:0b:3a:30:03:4e:1b:fd:fd:
fb:46:4f:ed:d3:f9:12:6b:57:f1:96:64:a6:de:9f:
b0:ab:02:47:20:9e:8b:36:37:7b:a3:50:26:d5:c8:
db:73:9c:a5:91:be:4e:92:ca:08:9d:bd:99:1c:23:
54:b7:bd:f5:dc:c3:82:20:e4:86:b9:af:3c:5f:05:
2c:92:c2:9c:59:8a:79:ee:45:e1:c3:ee:61:2f:ff:
b0:b7:f0:2c:fb:c0:06:a7:e5:c5:70:23:95:bc:dc:
98:e4:45:72:5e:70:cb:33:6e:6f:cd:c2:3a:85:96:
5e:d9:37:1c:93:89:e4:3e:8d:4c:da:d7:53:53:2f:
ec:08:78:4a:84:bb:fb:37:87:2a:a9:0d:fd:20:50:
fc:f4:2d:c0:88:88:4b:1e:2b:74:f0:76:54:40:7c:
50:37:b6:da:62:6a:d9:e0:83:d9:f9:12:13:62:45:
67:5a:a0:90:8e:8d:0c:62:ae:48:34:25:3c:0f:6c:
9c:2d:13:f2:34:2d:d5:d8:ba:be:da:5d:57:53:a9:
1b:c0:6e:f8:84:c7:39:87:09:ff:bf:85:79:84:9c:
96:f1:bb:9e:b8:04:a0:08:27:60:26:26:65:0c:15:
73:e5
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication, TLS Web Client Authentication
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Authority Key Identifier:
keyid:E1:18:75:9D:86:18:FC:68:53:26:50:B3:7D:45:5C:37:C3:58:6E:E9
X509v3 Subject Alternative Name:
DNS:*.portworx-etcd.portworx.svc.cluster.local, DNS:portworx-etcd-client.portworx.svc.cluster.local
Signature Algorithm: sha256WithRSAEncryption
4a:a8:4c:2a:ac:f4:3d:51:f2:15:ca:19:2b:58:1a:b1:8a:07:
97:0f:80:6b:5f:47:0b:0b:cb:77:69:83:4d:1c:7b:b1:e0:a1:
42:c6:0a:0b:31:fc:49:ca:2d:49:56:6a:78:39:d9:ad:47:b5:
ce:bb:ac:e7:f3:bd:46:78:b7:ac:f4:52:61:22:cf:96:b0:36:
b0:e1:86:73:15:65:4a:82:7e:f6:a8:a3:c2:17:d3:ae:79:dc:
4b:9f:3a:79:6d:13:bc:dc:8f:ea:ba:b5:86:01:03:36:1d:33:
ea:79:c0:4a:8a:01:7d:18:63:6c:31:f9:ea:86:9b:03:2f:43:
b8:01:ab:12:fc:ee:f2:e2:75:fb:9d:fd:6b:af:33:cf:fe:fd:
5c:71:64:90:9e:d6:77:d1:2e:59:b8:95:3a:3c:be:12:fa:05:
8a:0a:41:3c:dd:93:37:ac:e7:f4:f5:7b:72:48:33:ca:00:0d:
db:e4:6d:82:f3:24:8e:ae:fb:ad:46:1d:db:8d:a3:64:4c:6d:
b1:0c:37:a8:ea:e4:99:ac:e0:d7:35:8c:b0:91:a6:e5:a6:af:
ed:25:48:58:83:e0:0a:e4:4b:a3:09:a8:e3:a5:ca:20:49:fd:
3a:65:1d:f4:56:99:ee:15:19:fb:ab:ab:09:74:90:5d:37:68:
21:d4:b8:83
```
When providing the exact IPs that will be used in the certificate it also fails because it still is looking at the DNS names instead of the IP addresses in the certificate.
There's a DNS SAN on that cert but no IP SAN, so all it can do is check the DNS.
I've tried it with both DNS only and IP with DNS. This case is using only DNS because it was the last change I tried and it was working fine with 3.1.10. I didn't switch it back. I used the same config to confirm I had failed with 3.2.3. I can add IPs, but again IPs are not useful since I don't for sure know which IPs I'll have ahead of time when used with etcd-operator. The cluster still fails, but the peer now rejects connection from the initial member.
@cehoffman is the issue now that you want to disable SAN checking or is it that is etcd failing to confirm the DNS SAN resolves to the incoming connection's address? What's in the DNS records?
I believe the SAN checking when only DNS names are specified is the desired behavior. The DNS resolution is working and all the members follow the names pattern generically <cluster name>-<number 0000-9999>.<cluster-name>.<k8s namespace>.svc.cluster.local. In this example the dns query for any records for the first member are.
portworx-etcd-0000.portworx-etcd.portworx.svc.cluster.local. 17 IN A 10.2.5.82
It seems to me that etcd peer server is rejecting communication from peer clients because it always wants to use the connecting client IP for validation in the peer certificate. Running in kubernetes with etcd-operator, it only seems possible for the peer client to verify it is talking to a valid peer server and not for a peer server to validate the address of a peer client unless the peer client also passes along the DNS name it can be reached at.
I'm having a similar issue with both 3.2.2 and 3.2.3. My cluster comes up healthy but when I kill an etcd pod and then a new one gets scheduled, I will see this error from all living peers and the new pod will just die because it can't resume state.
I am setting env var ETCD_INITIAL_CLUSTER_STATE to existing when a pod gets rescheduled.
Is this issue definitely fixed?
I've just tried to update my cluster to v3.2.9 from v3.1.5, but after updating the first node, I received these same tls: "X.X.X.X" does not match any of DNSNames [... errors from the other peers refusing to talk to the upgraded node. Reverting back to the previous version immediately fixes the issue.
The IP lookup seems to be based on PTR queries. If your PTR queries don't come up properly, this will fail. I'm seeing this in an environment where we don't manipulate PTR records.
@pires thanks for the reply, PTR records of course... should have seen that. I don't have those unfortunately, it's private address space in an AWS VPC.
Using only a PTR lookup seems a bit fragile. I would have thought doing a lookup on the wildcard entries from the certs would be better, since you'd only need this info when establishing connections for the first time with new nodes.
Maybe I'm missing something.
In the meantime I might have to disable peer cert auth if I want to upgrade to v3.2.x, which I'd really prefer not to.
Client auth should be fine. It's peer auth that's failing.
My bad s/client/peer/, I've been doing that all afternoon.
I'm having the same problems. I'm using a VPC in AWS, so I cannot create PTR records.
What are the alternatives? The only ones I see are to disable the peer auth, or to not upgrade to 3.2.x at all (as with the 3.1.x works).
the behaviour I see is a bit weird. On a single node etcd cluster peer authentication is fine:
curl -k --key peer.key --cert peer.crt https://10.11.0.20:2380/members
{"id":9242707268920111284,"peerURLs":["https://10.11.0.20:2380"],"name":"bootstrap-etcd","clientURLs":["https://10.11.0.20:2379"]}
but right after second node joins, first node starts rejecting peers with the same certificate
2017-10-24 15:18:38.285587 I | etcdmain: rejected connection from "172.18.12.5:37968" (tls: "172.18.12.5" does not match any of DNSNames ["*.kube-etcd.kube-system.svc" "*.eu-west-1.compute.internal" "localhost"])
connection is done from kube hosted pod, to a static pod on the host network through kube service.
etcd: 3.2.9
kube: 1.8.0
ectd-operator: 0.6.0
Most helpful comment
the behaviour I see is a bit weird. On a single node etcd cluster peer authentication is fine:
but right after second node joins, first node starts rejecting peers with the same certificate
connection is done from kube hosted pod, to a static pod on the host network through kube service.
etcd: 3.2.9
kube: 1.8.0
ectd-operator: 0.6.0