Grpc-go: grpc.Dial does not return an error on bad certificates

Created on 31 Mar 2016 · 25Comments · Source: grpc/grpc-go

When using mutual-TLS with mismatched certificates, calling grpc.Dial(..., grpc.WithBlock()) never returns. The client connection is happily banging its head against bad certificate errors and never reports an error to the caller.

Worse, it seems that WithBlock() is extremely unreliable. It blocks until the connection is connected, but rarely returns a non-nil error to the caller in the error case (connection refused is one case where it does the right thing).

Source

tamird

Most helpful comment

There are some cases (like ours) where there is no load balancer in the
picture. There should be _some_ way of treating this error as non
transient, even if it's not the default behaviour.

On Thu, Mar 31, 2016 at 10:03 PM, Qi Zhao [email protected] wrote:

I can imagine that in some cases the name resolver or load balancer may
direct you to another server which can accepts the cert. I am still
debating with myself whether I should introduce some more tweaks into the
connection errors.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/grpc/grpc-go/issues/622#issuecomment-204207188

tamird on 1 Apr 2016

👍3

All 25 comments

Connection error is treated as transient so that grpc internal keeps retrying until you set a timeout. I expect you can see something from log with mismatched certificates.

iamqizhao on 1 Apr 2016

s/until/unless

iamqizhao on 1 Apr 2016

I could see the argument that cert errors are not transient, though. There's no point in retrying because it's expected to fail consistently.

dsymonds on 1 Apr 2016

I can imagine that in some cases the name resolver or load balancer may direct you to another server which can accepts the cert. I am still debating with myself whether I should introduce some more tweaks into the connection errors.

iamqizhao on 1 Apr 2016

There are some cases (like ours) where there is no load balancer in the
picture. There should be _some_ way of treating this error as non
transient, even if it's not the default behaviour.

On Thu, Mar 31, 2016 at 10:03 PM, Qi Zhao [email protected] wrote:

I can imagine that in some cases the name resolver or load balancer may
direct you to another server which can accepts the cert. I am still
debating with myself whether I should introduce some more tweaks into the
connection errors.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/grpc/grpc-go/issues/622#issuecomment-204207188

tamird on 1 Apr 2016

👍3

Let me talk to our security guru to decide whether security handshake error should be treated as a fatal error first.

iamqizhao on 1 Apr 2016

After the discussion, we will treat this as a fatal error. A pull request will be out to address this soon.

iamqizhao on 6 Apr 2016

@iamqizhao ping

tamird on 4 May 2016

@iamqizhao what's up with this?

tamird on 27 Jun 2016

This is a pain point for me as well. There doesn't seem to be a way to surface a TLS-related error to the calling code. It only goes to the log. This means an application where the user is not monitoring the logs directly can't signal the error condition.

aaronlehmann on 15 Jul 2016

@menghanl is working on a PR to improve this. It should be out soon.

iamqizhao on 15 Jul 2016

@iamqizhao Are there plans to handle this for use cases outside of WithBlock? Seems odd that we would just let an RPC method call hang forever. In general, it seems best to avoid WithBlock but there is no way to boil up errors on synchronous operations.

stevvooe on 16 Jul 2016

@stevvooe The thing becomes a bit tricky when there may be multiple underlying network connections. @menghanl will revise the PR (or in another separate PR) so that if WithBlock is not set:
i) If the rpc is fail-fast, it fails immediately with the fatal certificate error;
ii) if the rpc is not fail-fast, the rpcs keeps waiting since the next address(i.e., a new server) from the name resolver/balancer might save the world.

iamqizhao on 18 Jul 2016

ii) is bogus. On a bad certificate, the chance that the next RPC succeeds
is approximately zero.

On Jul 18, 2016 14:05, "Qi Zhao" [email protected] wrote:

@stevvooe https://github.com/stevvooe The thing becomes a bit tricky
when there may be multiple underlying network connections. @menghanl
https://github.com/menghanl will revise the PR (or in another separate
PR) so that if WithBlock is not set:
i) If the rpc is fail-fast, it fails immediately with the fatal
certificate error;
ii) if the rpc is not fail-fast, the rpcs keeps waiting since the next
address(i.e., a new server) from the name resolver/balancer might save the
world.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/622#issuecomment-233408934, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABdsPINuMImE5G5ZJzZYUYt3TPeuuwk5ks5qW8BRgaJpZM4H8-3F
.

tamird on 18 Jul 2016

@tamird I think typically this kind of error is on the server side and there is some chance a newly started server fixing the problem. Technically, a non-fail-fast rpc never fails with a single connection error regardless it is fatal or not.

iamqizhao on 18 Jul 2016

It's not true that the problem is typically on the server. If the server
rejects the client's certificate, it is almost certainly true that the
client is at fault, and the RPC will never succeed.

On Jul 18, 2016 14:36, "Qi Zhao" [email protected] wrote:

@tamird https://github.com/tamird I think typically this kind of error
is on the server side and there is some chance a newly started server
fixing the problem. Technically, a non-fail-fast rpc never fails with a
single connection error regardless it is fatal or not.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/622#issuecomment-233418002, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABdsPOSgAK1wPahOWsMKLGcJHsLwpZukks5qW8eSgaJpZM4H8-3F
.

tamird on 18 Jul 2016

👍1

The thing becomes a bit tricky when there may be multiple underlying network connections.

I get that.

Typically, this is handled by separating the connection-side from the RPC-side, which isn't really done in grpc-go. One can differentiate between a failure or series of failures on Dial and a single failure on a single RPC.

Let me ask this another way: how should we surface these kinds of errors to a user? How should we detect this particular state and report it?

stevvooe on 18 Jul 2016

@tamird In most of use case I have seen the client simply uses the machine CA (credentials.NewClientTLSFromCert(nil, "")) which has slim chance to get wrong compared to the server side.

iamqizhao on 18 Jul 2016

@iamqizhao

@tamird In most of use case I have seen the client simply uses the machine CA (credentials.NewClientTLSFromCert(nil, "")) which has slim chance to get wrong compared to the server side.

What about mtls? What about the distribution of self-signed certificates or certificates outside a machine's root CA?

In practice, I see people getting this wrong all the time.

Most troubleshooting I have seen focuses on getting the right certificates client-side to operate against a fixed server configuration. While this won't be a GRPC problem initially, it will become one as the adoption of GRPC and http/2 servers increase.

There needs to be good error messaging so that one can observe and address the problem.

stevvooe on 18 Jul 2016

“Seems odd that we would just let an RPC method call hang forever.”

Why? This is Go. If you need it to be asynchronous, you can just go func() { }() it.

A basic design principle in Go is that one should write ones API to be synchronous. Because, if the programmer needs asynchronous, then they can generally trivially make it asynchronous.

puellanivis on 18 Jul 2016

@puellanivis In that basic model, two bad things happen:

Errors are obscured from the "users" (maybe the caller).
Resources are leaked when callers go away.

In short, that analogy doesn't really apply here. There are lots of resources that may be locked under an RPC call. Simply putting the call into a goroutine then going away if it doesn't return in time will leak those resources (not to mention, the goroutine doing the call). We encapsulate this problem using Context and WithTimeout, deferring that functionality to grpc-go.

We can see this break down when there are multiple causal failures. Let's say we have a method that we'd like to attempt to call and return to a user in 1s. In the course of that 1s call, tens of TLS connection attempts may have failed. All the user will see in the current model is the timeout error, obscuring the root of the problem.

Effectively, we have Context to deal with problem 2 but the solution to problem 1 is not addressed.

stevvooe on 18 Jul 2016

@stevvooe Understood. To be clear, I said "most of use cases I have seen" instead of all the cases.

For non-fail-fast rpc and multiple underlying network connections, this is a complex mess even for Google internal rpc systems for years. The workaround we always suggested to our prod users is setting a deadline to the rpcs and checking the error log. This is not a good solution and we should improve. My thinking is that when a non-fail-fast rpc fails, we can put the real reason into its description so the the caller can take action. In addition, we can even add some grpc.IsErrorRetryable(...) to detect if the returned rpcError can be retried (e.g., in this case, it returns false). This deserves a good design.

As a short term goal, we can fix it when Balancer is not in the picture -- there is only a single destination address for both failfast and non-failfast rpcs. This will fix @tamird 's concern at least.

iamqizhao on 18 Jul 2016

Seems like a reasonable argument. I'm just noting that it should _not_ sound “odd” to do that. (omg, stupid typos. Sorry for the spam in correcting this tiny little comment.)

puellanivis on 18 Jul 2016

@iamqizhao Sounds good.

I would suggest adopting the Temporary convention used throughout the Go community.

stevvooe on 18 Jul 2016

This issue came up in discussions about issue #976.

In light of that issue, we've been re-evaluating the way DialContext establishes connections with targets returned by the balancer. We would like to unify the approach for this with the ongoing balancer watcher, rather than having two separate methods of connecting. PR #1112 implements this, but one notable side effect is that we will no longer return a specific connection error from a blocking DialContext -- only a timeout from the context would occur. We believe this is reasonable behavior and that returning connection errors from Dial is problematic for two reasons: 1. ongoing balancer updates could result in new connection failures, of which the client is never appraised, and 2. it is difficult to reasonably define how Dial should behave in the event of multiple errors, or with both errors and successes. The current behavior fails immediately upon any error connecting to any target; we are sure this is undesirable -- other targets could succeed, or may already have. With PR #1112, we will unblock after the first successful connection is made or the timeout elapses. The behavior of Dial to a single target server, not via a balancer, is unchanged.

Please update this issue if you have concerns about this change.