Cockroach: Health checks fail when using certificates (admin UI unusable)

Created on 22 Jul 2018  路  5Comments  路  Source: cockroachdb/cockroach

BUG REPORT

Please describe the issue you observed:

  • What did you do?

Ran cockroach start --cache=.25 --max-sql-memory=.25 after setting up certificates properly, using tutorial from Cockroach docs. Certificates work because I can log into the DB remotely and securely from my local client application and run queries. (This is a single-node cluster.)

Then I ran curl --insecure https://localhost:8080/health locally on the DB server.

  • What did you expect to see?

Something like this, which is what I get when I run cockroach start --insecure:

{
  "nodeId": 1,
  "address": {
    "networkField": "tcp",
    "addressField": "my-db:26257"
  },
  "buildInfo": {
    "goVersion": "go1.10",
    "tag": "v2.0.4",
    "time": "2018/07/16 20:25:32",
    "revision": "d7c99735249c963177eadbc5fd57fc2094de6823",
    "cgoCompiler": "gcc 6.3.0",
    "cgoTargetTriple": "x86_64-unknown-linux-gnu",
    "platform": "linux amd64",
    "distribution": "CCL",
    "type": "release",
    "channel": "official-binary",
    "dependencies": null
  }
}
  • What did you see instead?
{
  "error": "all SubConns are in TransientFailure",
  "code": 14
}

The admin UI loads, but all the AJAX requests respond with 503 across the board, so the UI is totally unusable.

Most helpful comment

Can you try specifying --advertise-host=example.com instead of --host=example.com?

--host determines which interfaces we're listening on. Defaults to hostname() which does not always give you the right thing.
--advertise-host is the hostname gossiped to other nodes, and ultimately what will be used for inter-node communications, so it needs to be in the server certificate. Defaults to the value of --host.

All 5 comments

I just discovered that by running cockroach start with the --host flag, the error is resolved and the admin UI works as expected.

$ cockroach start --cache=.25 --max-sql-memory=.25 --host example.com --http-host localhost

^ This is what I'm doing now to work around the problem, however I need to access the DB both through localhost and public interfaces, so using the public example.com interface/hostname for the client running on localhost isn't ideal.

Any suggestions?

Can you try specifying --advertise-host=example.com instead of --host=example.com?

--host determines which interfaces we're listening on. Defaults to hostname() which does not always give you the right thing.
--advertise-host is the hostname gossiped to other nodes, and ultimately what will be used for inter-node communications, so it needs to be in the server certificate. Defaults to the value of --host.

Oh my, you've done it. :) That does seem to fix it! I can now connect using localhost:26257 or example.com:26257 which is exactly what I was looking for. I'm not totally clear on why/how that works, but it does. I guess it's not a bug. Thanks!

Quick side-question: If I use the built-in client locally, say cockroach sql on the same machine, do I need to use TLS? I have TLS set up for my remote connection from my laptop, of course, but do you think, given this configuration, it is okay to disable TLS for localhost connections?

Glad to hear it. We're working on clarifying the use and meaning of --host and --advertise-host but here's the quick rundown for now:

  • --host tells the node which address to bind to. If left empty, it listens on all interfaces, otherwise is attempts to listen on the specified one.
  • --advertise-host is the name/address the node tells other nodes. All nodes need to be able to connect to all other nodes and they use that advertised address to do it. This means that the address needs to be in the server certificate to setup the SSL connection. There are many configurations where the automatically-determined value may not be the right one or even if it is, it may not be the DNS/IP you specified in the certificates.

When you originally changed --host, that changed --advertise-host as well but restricted which interfaces the node listened on. By leaving --host empty and specifying --advertise-host, you control which address is used by other nodes but keep listening on all interfaces.

Again, we're working on improving this as it's been a source of confusion for a lot of people, ourselves included.

Finally, the quick side-answer: TLS is currently an all-or-nothing kind of thing. The main reason is that you really don't want to allow any insecure connection when running in secure mode. TLS doesn't provide just transport encryption, it also provides authentication through client certificates. If TLS was disabled on localhost, anyone on localhost could just connect. Just because you're on the same machine doesn't mean you get to do anything you like. After all, a lot of people run bare metal with multiple users/jobs on the same machine. Finally, even if you were ok with that kind of access, the reason we strongly discourage insecure more is that the inter-node communication is only protected by TLS so turning it off means anyone can do anything using the inter-node API, even if you have password at the sql level.

Thank you, @mberhault! That is really helpful.

Cockroach is great so far, especially now with that cleared up. Getting ready for our first production deployment. Can't wait.

Was this page helpful?
0 / 5 - 0 ratings