Vault: Continual TLS Handshake errors between vault nodes

Created on 10 Mar 2016  ·  16Comments  ·  Source: hashicorp/vault

We've got the below setup and are seeing a large volume of errors like this in syslog, all entries refer to one of the two vault servers in the cluster, any idea what the cause could be?

Mar 9 14:49:25 vault001 vault: 2016/03/09 14:49:25 http: TLS handshake error from 10.28.188.128:35548: tls: first record does not look like a TLS handshake

{
  "backend": {
    "consul": {
      "address": "127.0.0.1:8500",
      "path": "vault-aws",
      "advertise_addr": "https://vault001:8200"
    }
  },
  "listener": {
    "tcp": {
      "address": "10.28.188.127:8200",
      "tls_cert_file": "/etc/ssl/certs/vault.crt",
      "tls_key_file": "/etc/ssl/certs/vault.key"
    }
  },
   "listener": {
    "tcp": {
      "address": "127.0.0.1:8200",
      "tls_disable": 1
    }
  }
}

Most helpful comment

I had the same problem and this issue doesn't look like a bug. I agree with @jefferai , perhaps you have a health check that uses HTTP instead of HTTPS. In your comments, your hostname is resolved to the HTTPS address so whatever tcp/http consul check to this address will show this trace in the Vault log.

Keep in mind that this trace is only a warning, so your Vault service still working fine.

In my case, I had a consul HTTP health check running and that's why the TLS handshake error happened. I debugged the whole Vault code and there are not any internals connections from/to Vault with HTTP, so I can say that my problem was the consul health check, not Vault.

I expect this information will be useful for other cases and that's why I describe my case below.

  • My Vault configuration file is shown below:
backend "zookeeper" {
   address = "lab1.mydomain.com:2181,lab2.mydomain.com:2181,lab3.mydomain.com:2181"
   path = "vault/"
   advertise_addr = "https://lab2.mydomain.com:8200"
   znode_owner = "digest:user:tpUq/4Pn5A64fVZyQ0gOJ8ZWqkY="
   auth_info = "digest:user:password"
}

listener "tcp" {
   address = "0.0.0.0:8200"
   tls_cert_file = "/opt/mydomain/vault/conf/secrets/vault.crt"
   tls_key_file = "/opt/mydomain/vault/conf/secrets/vault.key"
}
  • Both, the DNS and the x509 certificate include the "lab2.mydomain.com" hostname for dns resolution and TLS verification.
  • Finally, my Vault service only shows this trace when the consul health check is running.
● vault.service - Vault service
   Loaded: loaded (/etc/systemd/system/vault.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2016-10-25 14:14:22 UTC; 34min ago
 Main PID: 19963 (vault)
   CGroup: /system.slice/vault.service
           └─19963 /opt/mydomain/vault/bin/vault server -config=/opt/mydomain/vault/conf/config.hcl

Oct 25 14:47:04 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42083: EOF
Oct 25 14:47:14 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42085: EOF
Oct 25 14:47:24 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42087: EOF
Oct 25 14:47:34 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42089: EOF
Oct 25 14:47:44 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42091: EOF
Oct 25 14:47:54 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42093: EOF
Oct 25 14:48:04 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42096: EOF
Oct 25 14:48:14 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42098: EOF
Oct 25 14:48:24 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42100: EOF
Oct 25 14:48:34 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42102: EOF
  • I fixed this problem with a consul script that run HTTPS instead of TCP/HTTP health check.

All 16 comments

Hi @nickwales,

What exactly is at 10.28.188.128?

10.28.188.128 is the other vault server in the cluster of two. All the messages are similar, its either 10.28.188.128: or 10.28.188.127:.

So it all _seems_ to be vault internal communication, its happening several times a second. The cluster isn't being used for real yet so there isn't anything hitting it at that rate.

@nickwales Can you please cross check if you are using HTTPS instead of HTTP? This error is usually seen if TLS is enabled but HTTP is used to make the calls.

@nickwales In particular, your advertise_addr is using https but you are using HTTP, not HTTPS, for serving. This will certainly cause problems.

I have two listeners, one is on localhost and the other on the IP of the machine because SAN certs with IP addresses (in particular 127.0.0.1) are hard to come by from 3rd parties.

Should the listener on the IP address use

"address": "https://10.28.188.127:8200" ?

instead of

"address": "10.28.188.127:8200" ?

Can you explain the network setup at all? I don't know what vault001 resolves to, but my guess is that it's resolving to different things at different times to different hosts.

Also, what client(s) are you using? It seems like what's going on is that your client is making a non-TLS connection to localhost, being redirected, and then making a non-TLS connection to a TLS endpoint. That, or the client is simply making a non-TLS connection to the TLS endpoint from the get go.

vault001.domain.tld = 10.28.188.127
vault002.domain.tld = 10.28.188.128

These are actually FQDN names, I removed the domains for simplicity. The domain names are included in the SAN cert and each server can request from the other without cert warnings.

Those addresses only resolve to those IP's, I checked.

Its a new cluster running 0.5.1, there are no clients other than the occasional test script and a consul health check running.

I can't even replicate this on a cluster I setup to try exactly that!

You mean, you cannot reproduce this behavior on a second cluster? :-/

Any info on the clients?

There are effectively no clients right now.

Another cluster setup with the same puppet code doesn't log this level of errors. The certs are appropriate for the cluster so are different.

I don't currently have any good ideas -- I still think it's something connecting to Vault using HTTP instead of HTTPS. Re-looking at your first log, it seems like something on vault002 is connecting to Vault on vault001 -- but Vault nodes do not talk to each other, the only coordination is via Consul, and Consul doesn't talk to Vault. Do you have any monitoring scripts or port scanners or health checks (including Consul health checks, which may not be configured to use HTTPS)?

I had the same problem and this issue doesn't look like a bug. I agree with @jefferai , perhaps you have a health check that uses HTTP instead of HTTPS. In your comments, your hostname is resolved to the HTTPS address so whatever tcp/http consul check to this address will show this trace in the Vault log.

Keep in mind that this trace is only a warning, so your Vault service still working fine.

In my case, I had a consul HTTP health check running and that's why the TLS handshake error happened. I debugged the whole Vault code and there are not any internals connections from/to Vault with HTTP, so I can say that my problem was the consul health check, not Vault.

I expect this information will be useful for other cases and that's why I describe my case below.

  • My Vault configuration file is shown below:
backend "zookeeper" {
   address = "lab1.mydomain.com:2181,lab2.mydomain.com:2181,lab3.mydomain.com:2181"
   path = "vault/"
   advertise_addr = "https://lab2.mydomain.com:8200"
   znode_owner = "digest:user:tpUq/4Pn5A64fVZyQ0gOJ8ZWqkY="
   auth_info = "digest:user:password"
}

listener "tcp" {
   address = "0.0.0.0:8200"
   tls_cert_file = "/opt/mydomain/vault/conf/secrets/vault.crt"
   tls_key_file = "/opt/mydomain/vault/conf/secrets/vault.key"
}
  • Both, the DNS and the x509 certificate include the "lab2.mydomain.com" hostname for dns resolution and TLS verification.
  • Finally, my Vault service only shows this trace when the consul health check is running.
● vault.service - Vault service
   Loaded: loaded (/etc/systemd/system/vault.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2016-10-25 14:14:22 UTC; 34min ago
 Main PID: 19963 (vault)
   CGroup: /system.slice/vault.service
           └─19963 /opt/mydomain/vault/bin/vault server -config=/opt/mydomain/vault/conf/config.hcl

Oct 25 14:47:04 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42083: EOF
Oct 25 14:47:14 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42085: EOF
Oct 25 14:47:24 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42087: EOF
Oct 25 14:47:34 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42089: EOF
Oct 25 14:47:44 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42091: EOF
Oct 25 14:47:54 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42093: EOF
Oct 25 14:48:04 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42096: EOF
Oct 25 14:48:14 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42098: EOF
Oct 25 14:48:24 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42100: EOF
Oct 25 14:48:34 lab2 vault[19963]: http: TLS handshake error from 10.200.1.33:42102: EOF
  • I fixed this problem with a consul script that run HTTPS instead of TCP/HTTP health check.

@ghost
Thanks mate, your description helped me to solve my problem.
I was using TCP health-check instead of HTTPS.

@den-is @ChrisMacNaughton how did you solve this, can you please explain and give an example of what you mean by "I fixed this problem with a consul script that run HTTPS instead of TCP/HTTP health check." - i'm reading this post for the 5th time and i'm unable to understand how.

@innovia you should use whatever LB (AWS ELB in my case) which provides HTTPS offloading and sends proper HTTP headers like X-Forwarded-For.
I was using simple TCP loadbalancer which is not suitable for HTTP "load_balancing/proxying" - so actual vault servers didn't know how to properly handle TLS connection with client (me)

@den-is thank you so much!

So Im using kubernetes it has health checks and readiness probes , these health checks were TCP like you said

so i;ve changed it to HTTPS

Glad to hear it's all working.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ekristen picture ekristen  ·  60Comments

wpg4665 picture wpg4665  ·  39Comments

hashbrowncipher picture hashbrowncipher  ·  65Comments

TopherGopher picture TopherGopher  ·  36Comments

sochoa picture sochoa  ·  39Comments