Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)
Ubuntu 16.04
Consul v1.0.6
Vault v0.10.2
Since upgrading from v0.7.1 to v0.8.3, nomad client -> server RPC communication appears to error out continuously after I rotate my TLS certificates and send a SIGHUP to the nomad client PID. Additionally, all jobs with Vault integrations that are scheduled _after_ the SIGHUP has been sent fail with similar RPC EOF errors. I have been able to reproduce this in two different environments.
1) Deploy a cluster of nomad servers and nomad clients (>1 each).
2) Generate TLS certificates and enable TLS for all servers and clients
tls {
http = true
rpc = true
ca_file = "/usr/local/share/ca-certificates/vault.crt"
cert_file = "/secrets/nomad-client/server.crt"
key_file = "/secrets/nomad-client/server.key"
verify_server_hostname = true
}
3) Generate a new certificate for a nomad client (signed by the same CA as the previously used certificate) and overwrite cert_file and key_file with the new contents (don't append to the existing files, overwrite them completely). The previously used certificate should still be valid (not expired).
4) Send the nomad client process a SIGHUP
/bin/kill -HUP $MAINPID
5) Validate that nomad client logs are throwing RPC errors when communicating to the nomad servers.
6) Schedule a job that uses the Vault integration, and validate that tasks are unable to create vault tokens due to similar (perhaps same) RPC EOF errors.
Could not find anything relevant. It appears that this issue is not present in the servers, as I have reproduced the same steps without any known issues.
Jun 13 14:55:05 nomad-c-qa-1-181 systemd[1]: Reloading Nomad Agent.
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: ==> Caught signal: hangup
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: ==> Reloading configuration...
Jun 13 14:55:05 nomad-c-qa-1-181 systemd[1]: Reloaded Nomad Agent.
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: 2018/06/13 14:55:05.909877 [ERR] nomad: "Node.GetClientAllocs" RPC failed to server 10.3.2.29:4647: rpc error: EOF
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: 2018/06/13 14:55:05.909923 [ERR] client: failed to query for node allocations: rpc error: EOF
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: nomad: "Node.GetClientAllocs" RPC failed to server 10.3.2.29:4647: rpc error: EOF
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: client: failed to query for node allocations: rpc error: EOF
vault: failed to derive token: DeriveVaultToken RPC failed: rpc error: EOF
@SoMuchToGrok Can you provide sample certificates that demonstrate this behavior? Also are you seeing the EOF forever after sending the signal or just immediately after. We do close connections after reloading the certificates so this would be expected behavior. Sustained errors would be bad though.
I'm seeing the EOF forever (well...at least 24+ hours and counting; going to assume it will keep going on). I'm additionally seeing secrets expire on the Vault side, but nomad is continuing to use them (presumably it'll try to keep using the secret even after it fails to communicate with Vault).
I'll work on generating some sample certificates, including the full chain, and will update when I have them.
Just chiming in: I believe we've encountered this issue on a Nomad 0.8.4/Ubuntu 16.04 cluster.
After updating certificates (issued via Let's Encrypt) and issuing a SIGHUP to reload the Nomad server instances, clients began reporting rpc error: EOF errors. We were able to hit the HTTP API of two out of 3 Nomad servers OK during this time, but the third returned the same "rpc error: EOF." This was happening for at least fifteen minutes with no sign of recovery.
I'm not yet sure if it's relevant, but our tooling issued the SIGHUP against all three Nomad servers within about a second of one another.
We will try to establish a reproduction case and report back.
/edit restarting the Nomad server instances resolved the issue.
As a reproduction test case, I created and distributed self-signed certificates for every node (3x servers, 2x clients in our case) and issued a SIGHUP against all Nomad services.
After repeatedly swapping between Let's Encrypt and self-signed key material then triggering a reload, I've been unable to reproduce the scenario above. I'll keep trying.
Helo, I'm working with @SoMuchToGrok and attached should be a full chain of certificates:
$ vault read -format=json pki/root/cert/ca | jq .data.certificate -r
-----BEGIN CERTIFICATE-----
MIIC+DCCAn6gAwIBAgIUJpGRsioeKF0pDWAsd0TcJ7CtmIowCgYIKoZIzj0EAwIw
EjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ2NTlaFw0xODA3MzExMjQ3
MjlaMBIxEDAOBgNVBAMTB3Jvb3QgY2EwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAAS8
iIHzsTHzBxkrOfv1FIRHVdk9g70IZi+JkcDz3z8pWMNoaGE4FTv2o5lg60cBsmbG
sGpHQdex9d+zxcHohX83vJHE1vlsJQyka8PzlmXMVdmXwi9Xfx0ewc7UJoAhNxSj
ggGTMIIBjzAOBgNVHQ8BAf8EBAMCAQYwEgYDVR0TAQH/BAgwBgEB/wIBATAdBgNV
HQ4EFgQUeUj2Bv5D5EBQ6KA4InKlp2u9/CQwHwYDVR0jBBgwFoAUeUj2Bv5D5EBQ
6KA4InKlp2u9/CQwgZoGCCsGAQUFBwEBBIGNMIGKMEoGCCsGAQUFBzAChj5odHRw
czovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25zdWw6ODIwMC92MS9w
a2kvcm9vdC9jYTA8BggrBgEFBQcwAoYwaHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmNv
bnN1bDo4MjAwL3YxL3BraS9yb290L2NhMIGLBgNVHR8EgYMwgYAwRaBDoEGGP2h0
dHBzOi8vdmF1bHQuc2VydmljZS5hd3MtdXMtd2VzdC0yLmNvbnN1bDo4MjAwL3Yx
L3BraS9yb290L2NybDA3oDWgM4YxaHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmNvbnN1
bDo4MjAwL3YxL3BraS9yb290L2NybDAKBggqhkjOPQQDAgNoADBlAjEAyj6bhInj
EFaBz1JZ2FG4vGmAAMRivoZYhPdupkBISnLqa/+BeFHp0c3WGqDVlFAcAjBU/D0X
z4FB9JGGjmPChVevL98xToGJO9wHhT/fZoEqmTvtfoV9YcSLV8+lY/F3GaI=
-----END CERTIFICATE-----
$ vault write -format=json pki/aws-us-west-2/issue/nomad common_name=nomad-server.node.consul alt_names=alt_names=nomad-server.node.aws-us-west-2.consul,nomad-server.service.consul,nomad-server.service.aws-us-west-2.consul,server.global.nomad,localhost ip_sans=127.0.0.1 format=pem ttl=720h
{
"request_id": "2cfa5cf5-2406-e67d-ddb5-7c3e88ba2600",
"lease_id": "pki/aws-us-west-2/issue/nomad/62481b8f-3e31-8e06-4e69-a493ee27e635",
"lease_duration": 2591999,
"renewable": false,
"data": {
"ca_chain": [
"-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----"
],
"certificate": "-----BEGIN CERTIFICATE-----\nMIIDtTCCAzqgAwIBAgIUc+s3scUkr8pvfaV3yUzAXFHBiqMwCgYIKoZIzj0EAwIw\nHDEaMBgGA1UEAxMRSUNBIGF3cy11cy13ZXN0LTIwHhcNMTgwNjI5MTI0OTEzWhcN\nMTgwNzI5MTI0OTQzWjAjMSEwHwYDVQQDExhub21hZC1zZXJ2ZXIubm9kZS5jb25z\ndWwwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAAQRUB6RQBtV3pyRWgNpsVLvXOMeF7yz\ncxRuB9uwRheDK4znMVEdF3PY8YK18Dlkb6BeKVt4xhXSfZNISJXTwEqFzx+V8wWr\n/mOY1MUPH09Q+GmGFapMn8GLOebJU1Vic4KjggI0MIICMDAOBgNVHQ8BAf8EBAMC\nA6gwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMB0GA1UdDgQWBBQr7X3A\nBELyx9saab7136CylBRxtjAfBgNVHSMEGDAWgBTxc2ZQEV7XeeCa34ofirMyWYpN\nozCBmgYIKwYBBQUHAQEEgY0wgYowSgYIKwYBBQUHMAKGPmh0dHBzOi8vdmF1bHQu\nc2VydmljZS5hd3MtdXMtd2VzdC0yLmNvbnN1bDo4MjAwL3YxL3BraS9yb290L2Nh\nMDwGCCsGAQUFBzAChjBodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgyMDAv\ndjEvcGtpL3Jvb3QvY2EwgZMGA1UdEQSBizCBiIIYbm9tYWQtc2VydmVyLm5vZGUu\nY29uc3Vsgglsb2NhbGhvc3SCKW5vbWFkLXNlcnZlci5zZXJ2aWNlLmF3cy11cy13\nZXN0LTIuY29uc3Vsghtub21hZC1zZXJ2ZXIuc2VydmljZS5jb25zdWyCE3NlcnZl\nci5nbG9iYWwubm9tYWSHBH8AAAEwgYsGA1UdHwSBgzCBgDBFoEOgQYY/aHR0cHM6\nLy92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3VsOjgyMDAvdjEvcGtp\nL3Jvb3QvY3JsMDegNaAzhjFodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgy\nMDAvdjEvcGtpL3Jvb3QvY3JsMAoGCCqGSM49BAMCA2kAMGYCMQDHp+5xLJGo0kRp\nnkYtrVASlf/IB/NLLwCK3CyvCYboXM3erv286ZBW6a44AMGwtOICMQDEPNhJDq6n\nNJWXuKiQg1e68YG1Dpa63rKeiUQ+7Q4paO5Dp3aSJMF9BPDeKuz16K8=\n-----END CERTIFICATE-----",
"issuing_ca": "-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----",
"private_key": "-----BEGIN EC PRIVATE KEY-----\nMIGkAgEBBDAVW+3UFVXYiGn5xpNsTh6qlJaaEylAK/RRrKNxGbEoTL2f8PraWxNB\nf52yLnChoe+gBwYFK4EEACKhZANiAAQRUB6RQBtV3pyRWgNpsVLvXOMeF7yzcxRu\nB9uwRheDK4znMVEdF3PY8YK18Dlkb6BeKVt4xhXSfZNISJXTwEqFzx+V8wWr/mOY\n1MUPH09Q+GmGFapMn8GLOebJU1Vic4I=\n-----END EC PRIVATE KEY-----",
"private_key_type": "ec",
"serial_number": "73:eb:37:b1:c5:24:af:ca:6f:7d:a5:77:c9:4c:c0:5c:51:c1:8a:a3"
},
"warnings": null
}
md5-70842560b9ecce79558e9c11d253f7bc
$ vault write -format=json pki/aws-us-west-2/issue/nomad-client common_name=nomad-client.node.consul alt_names=alt_names=nomad-client.node.aws-us-west-2.consul,nomad-client.service.consul,nomad-client.service.aws-us-west-2.consul,client.global.nomad,localhost ip_sans=127.0.0.1 format=pem ttl=720h
md5-e44df323f7259182938cec9314445346
{
"request_id": "606ba187-fa58-59b5-6f96-f12c680d9d97",
"lease_id": "pki/aws-us-west-2/issue/nomad-client/03f67c56-1b51-7eed-b63b-3875fe3d3e49",
"lease_duration": 2591999,
"renewable": false,
"data": {
"ca_chain": [
"-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----"
],
"certificate": "-----BEGIN CERTIFICATE-----\nMIIDtDCCAzqgAwIBAgIUI/Y+orYMNzY1fwQoXV3cY4lqn+cwCgYIKoZIzj0EAwIw\nHDEaMBgGA1UEAxMRSUNBIGF3cy11cy13ZXN0LTIwHhcNMTgwNjI5MTI1MDI1WhcN\nMTgwNzI5MTI1MDU1WjAjMSEwHwYDVQQDExhub21hZC1jbGllbnQubm9kZS5jb25z\ndWwwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAATk5GvLPCFWDPsgf3RPGwJwuZPlxvdx\nyubVtQkJ89hm7l1wBB0E45egEdjrPr1SBTtfADHaLVwoVj3M9N1SezI8jVUMd1eA\ntU5F+auyxo6DwFxdJ+F5lGScknhzrn67LCyjggI0MIICMDAOBgNVHQ8BAf8EBAMC\nA6gwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMB0GA1UdDgQWBBSqytOZ\nw3ucO3Y5OCUkAOjCY30KKDAfBgNVHSMEGDAWgBTxc2ZQEV7XeeCa34ofirMyWYpN\nozCBmgYIKwYBBQUHAQEEgY0wgYowSgYIKwYBBQUHMAKGPmh0dHBzOi8vdmF1bHQu\nc2VydmljZS5hd3MtdXMtd2VzdC0yLmNvbnN1bDo4MjAwL3YxL3BraS9yb290L2Nh\nMDwGCCsGAQUFBzAChjBodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgyMDAv\ndjEvcGtpL3Jvb3QvY2EwgZMGA1UdEQSBizCBiIIYbm9tYWQtY2xpZW50Lm5vZGUu\nY29uc3VsghNjbGllbnQuZ2xvYmFsLm5vbWFkgglsb2NhbGhvc3SCKW5vbWFkLWNs\naWVudC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vsghtub21hZC1jbGllbnQu\nc2VydmljZS5jb25zdWyHBH8AAAEwgYsGA1UdHwSBgzCBgDBFoEOgQYY/aHR0cHM6\nLy92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3VsOjgyMDAvdjEvcGtp\nL3Jvb3QvY3JsMDegNaAzhjFodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgy\nMDAvdjEvcGtpL3Jvb3QvY3JsMAoGCCqGSM49BAMCA2gAMGUCMAiR0BaODupT4anH\np+XmGqhaqPbfMFbqkGlPNdpsclAcZO8rF43JlzPvPP4tZJwCmAIxAIUCUvY3ajrb\nie+z13o1md7xfEs0ZuARBZvNJVcPaGFvI6+D00qbL1O5wazCR1ErlA==\n-----END CERTIFICATE-----",
"issuing_ca": "-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----",
"private_key": "-----BEGIN EC PRIVATE KEY-----\nMIGkAgEBBDCB+6MFvnMGg9wqz5o1NMX+shLEi7eGoaDM6PnpitbPcs9/1GtW+hMu\nEIdrQo1joh6gBwYFK4EEACKhZANiAATk5GvLPCFWDPsgf3RPGwJwuZPlxvdxyubV\ntQkJ89hm7l1wBB0E45egEdjrPr1SBTtfADHaLVwoVj3M9N1SezI8jVUMd1eAtU5F\n+auyxo6DwFxdJ+F5lGScknhzrn67LCw=\n-----END EC PRIVATE KEY-----",
"private_key_type": "ec",
"serial_number": "23:f6:3e:a2:b6:0c:37:36:35:7f:04:28:5d:5d:dc:63:89:6a:9f:e7"
},
"warnings": null
}
Please let us know if there's any more information we can provide to help you with this.
Is there anything additional we can do to help with this issue? We're still stuck on Nomad v0.7.1 because of this bug.
I was able to reproduce with the following:
When you re-issue a cert and send a SIGHUP to the nomad client, everything is fine. It's when you re-issue a cert for the nomad server that the problem manifests itself.
You can see after the SIGHUP is sent that there are a lot of errors like nomad.rpc: Closing server RPC connection
==> Caught signal: hangup
==> Reloading configuration...
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
2018/08/10 13:42:19.187842 [DEBUG] agent: starting reload of agent config
2018/08/10 13:42:19.193892 [DEBUG] agent: starting reload of server config
2018/08/10 13:42:19.194197 [INFO] nomad: reloading server connections due to configuration changes
2018/08/10 13:42:19.208107 [DEBUG] nomad: finished reloading server connections
2018/08/10 13:42:19.208131 [INFO] agent: Reloading HTTP server with new TLS configuration
2018/08/10 13:42:19.208138 [DEBUG] http: Shutting down http server
2018/08/10 13:42:19.460633 [DEBUG] vault: not renewing token as it is root
2018/08/10 13:42:20.312419 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:23.616124 [DEBUG] http: Request GET /v1/agent/health?type=server (154.333碌s)
2018/08/10 13:42:25 [DEBUG] memberlist: TCP connection from=127.0.0.1:55900
2018/08/10 13:42:26.568915 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2018/08/10 13:42:33.637483 [DEBUG] http: Request GET /v1/agent/health?type=server (139.611碌s)
2018/08/10 13:42:35 [DEBUG] memberlist: TCP connection from=127.0.0.1:55905
2018/08/10 13:42:36.448273 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:37.799745 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:37.801752 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:39.377860 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:39.379074 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:40.879984 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:40.881732 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:42.812276 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:42.814450 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:43.658705 [DEBUG] http: Request GET /v1/agent/health?type=server (419.168碌s)
2018/08/10 13:42:44.557962 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:44.560620 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:45 [DEBUG] memberlist: TCP connection from=127.0.0.1:55912
2018/08/10 13:42:45.579913 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:45.581623 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:46.446298 [WARN] nomad.heartbeat: node 'cb31c673-27e0-5bf1-9f88-ac352224c746' TTL expired
2018/08/10 13:42:46.959658 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:46.961002 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:48.302956 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:48.303873 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:49.746487 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:49.747652 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:51.271261 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:51.272460 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:53.268205 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:53.270470 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:53.680464 [DEBUG] http: Request GET /v1/agent/health?type=server (505.173碌s)
2018/08/10 13:42:54.313170 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:54.314829 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:55 [DEBUG] memberlist: TCP connection from=127.0.0.1:55930
2018/08/10 13:42:55.500682 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:55.502501 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:56.572457 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2018/08/10 13:42:57.299575 [INFO] nomad.rpc: Closing server RPC connection
2018/08/10 13:42:57.301394 [INFO] nomad.rpc: Closing server RPC connection
Shortly after the nomad server starts erroring, you can see rpc errors happen in the client over and over.
2018/08/10 13:42:36.446591 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:36.446714 [ERR] client: heartbeating failed. Retrying in 1.350090439s: failed to update status: rpc error: EOF
2018/08/10 13:42:36.447385 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:36.448472 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:37.800052 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:37.800113 [ERR] client: heartbeating failed. Retrying in 1.573605339s: failed to update status: rpc error: EOF
2018/08/10 13:42:37.800912 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:37.801971 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:39.377992 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:39.378033 [ERR] client: heartbeating failed. Retrying in 1.499009466s: failed to update status: rpc error: EOF
2018/08/10 13:42:39.378474 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:39.379204 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:40.880297 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:40.880340 [ERR] client: heartbeating failed. Retrying in 1.928699781s: failed to update status: rpc error: EOF
2018/08/10 13:42:40.880918 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:40.881891 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:42.812550 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:42.812624 [ERR] client: heartbeating failed. Retrying in 1.739761915s: failed to update status: rpc error: EOF
2018/08/10 13:42:42.813176 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:42.814840 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:43.149744 [DEBUG] http: Request GET /v1/agent/health?type=client (388.307碌s)
2018/08/10 13:42:44.154786 http: TLS handshake error from 127.0.0.1:55911: tls: first record does not look like a TLS handshake
2018/08/10 13:42:44.558265 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:44.558371 [ERR] client: heartbeating failed. Retrying in 1.017166125s: failed to update status: rpc error: EOF
2018/08/10 13:42:44.559671 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:44.560783 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:45.580207 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:45.580275 [ERR] client: heartbeating failed. Retrying in 1.377125573s: failed to update status: rpc error: EOF
2018/08/10 13:42:45.580856 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:45.581789 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:46.959888 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:46.959919 [ERR] client: heartbeating failed. Retrying in 1.33951467s: failed to update status: rpc error: EOF
2018/08/10 13:42:46.960373 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:46.961139 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:48.303057 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:48.303082 [ERR] client: heartbeating failed. Retrying in 1.439765118s: failed to update status: rpc error: EOF
2018/08/10 13:42:48.303428 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:48.303961 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:49.746619 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:49.746648 [ERR] client: heartbeating failed. Retrying in 1.519506656s: failed to update status: rpc error: EOF
2018/08/10 13:42:49.747089 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:49.747776 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:51.271401 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:51.271432 [ERR] client: heartbeating failed. Retrying in 1.991679597s: failed to update status: rpc error: EOF
2018/08/10 13:42:51.271868 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:51.272580 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:53.171400 [DEBUG] http: Request GET /v1/agent/health?type=client (134.281碌s)
2018/08/10 13:42:53.268541 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:53.268637 [ERR] client: heartbeating failed. Retrying in 1.040439126s: failed to update status: rpc error: EOF
2018/08/10 13:42:53.269526 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:53.270664 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:54.313479 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:54.313526 [ERR] client: heartbeating failed. Retrying in 1.186429866s: failed to update status: rpc error: EOF
2018/08/10 13:42:54.314053 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:54.314997 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:55.501162 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:55.501230 [ERR] client: heartbeating failed. Retrying in 1.79393641s: failed to update status: rpc error: EOF
2018/08/10 13:42:55.501732 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:55.502660 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:57.299842 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:57.299965 [ERR] client: heartbeating failed. Retrying in 1.396287817s: failed to update status: rpc error: EOF
2018/08/10 13:42:57.300590 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:57.301560 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:
* rpc error: EOF
2018/08/10 13:42:58.299391 [ERR] nomad: "Node.GetClientAllocs" RPC failed to server 127.0.0.1:4647: rpc error: EOF
2018/08/10 13:42:58.299421 [ERR] client: failed to query for node allocations: rpc error: EOF
2018/08/10 13:42:58.697862 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: failed to get conn: dial tcp 127.0.0.1:4647: connect: connection refused
2018/08/10 13:42:58.697926 [ERR] client: heartbeating failed. Retrying in 1.222660464s: failed to update status: rpc error: failed to get conn: dial tcp 127.0.0.1:4647: connect: connection refused
2018/08/10 13:42:58.698557 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
2018/08/10 13:42:58.699032 [ERR] client.consul: error discovering nomad servers: no Nomad Servers advertising service "nomad" in Consul datacenters: ["dc1"]
While this chaos is going down, if you try to do a nomad job status you get an error:
$ nomad job status
Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
Although if I let the server / client run for a few more minutes, the CLI will return data but I will be unable to place new jobs (which is expected I think)
$ nomad job run example-job.hcl
==> Monitoring evaluation "659d2fff"
Evaluation triggered by job "example"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "659d2fff" finished with status "complete" but failed to place all allocations:
Task Group "hello-world" (failed to place 1 allocation):
* No nodes were eligible for evaluation
* No nodes are available in datacenter "dc1"
Evaluation "d4012573" waiting for additional capacity to place remainder
I believe I found the issue! I will open a PR with a proposed fix - let me know if that will be acceptable!
@dadgar @cheeseprocedure
Most helpful comment
Just chiming in: I believe we've encountered this issue on a Nomad 0.8.4/Ubuntu 16.04 cluster.
After updating certificates (issued via Let's Encrypt) and issuing a SIGHUP to reload the Nomad server instances, clients began reporting
rpc error: EOFerrors. We were able to hit the HTTP API of two out of 3 Nomad servers OK during this time, but the third returned the same "rpc error: EOF." This was happening for at least fifteen minutes with no sign of recovery.I'm not yet sure if it's relevant, but our tooling issued the SIGHUP against all three Nomad servers within about a second of one another.
We will try to establish a reproduction case and report back.
/edit restarting the Nomad server instances resolved the issue.