Hi, we just pushed the code to production, and one hour after started seeing unexpected UnauthenticatedError exceptions.
cause is GRPC::BadStatus and message is "16:Faraday::ConnectionFailed", .details="Faraday::ConnectionFailed"
Which looks like transient HTTP AUTH connection failure, which should be retriable, and in any case should not be reported as Google::Cloud::UnauthenticatedError.
Also I didn't think Faraday is still in use after update to grpc.
Can you please investigate the issue, as it is out of the expected behavior and I had to hot-patch the production.
token refresh?
Update: looks like token refresh indeed. Problems started exactly (within couple of seconds) 1 hour after the corresponding process restart on multiple Heroku dynos.
Indeed, UNAUTHENTICATED = 16 in google/rpc/code.proto.
cc @murgatroid99 Any thoughts on this behavior?
The Google Auth library has a dependency on Faraday, so it might be something related to that.
Btw, we have not seen this problem running on staging setup. Must be related to traffic somehow
You don't happen to have the backtraces, do you?
@blowmage I am also trying to understand why Faraday is present. Is it loaded here by google-cloud-core, do you think?
Google Auth is built on top of Signet, and Signet is built on top of Faraday. When the authentication token is refreshed it makes a call to a Google service to get a new hash. Those calls are all made through Faraday. The actual API calls to Datastore are all through GRPC, using the authentication hash from the token object.
@blowmage I have a stacktrace for top-level exception which is not very helpful:
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-datastore-0.20.1/lib/google/cloud/datastore/service.rb:154:in `rescue in execute'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-datastore-0.20.1/lib/google/cloud/datastore/service.rb:150:in `execute'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-datastore-0.20.1/lib/google/cloud/datastore/service.rb:83:in `lookup'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-datastore-0.20.1/lib/google/cloud/datastore/dataset.rb:339:in `find_all'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-datastore-0.20.1/lib/google/cloud/datastore/dataset.rb:308:in `find'
Oh, just before that stacktrace, I see what must be native code prints:
E0912 21:34:16.019106760 130 plugin_credentials.c:67] Getting metadata from plugin failed with error: Faraday::ConnectionFailed
E0912 21:34:16.019138486 130 client_auth_filter.c:93] Client side authentication failure: Faraday::ConnectionFailed
Also we have (much smaller) number of unwrapped Faraday::ConnectionFailed exceptions with Net::OpenTimeout cause. Stack trace in that case is more useful:
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/adapter/net_http.rb:82:in `perform_request'
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/adapter/net_http.rb:40:in `block in call'
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/adapter/net_http.rb:87:in `with_net_http_connection'
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/adapter/net_http.rb:32:in `call'
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/request/url_encoded.rb:15:in `call'
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/rack_builder.rb:139:in `build_response'
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/connection.rb:377:in `run_request'
/app/vendor/bundle/ruby/2.3.0/gems/faraday-0.9.2/lib/faraday/connection.rb:177:in `post'
/app/vendor/bundle/ruby/2.3.0/gems/signet-0.7.3/lib/signet/oauth_2/client.rb:960:in `fetch_access_token'
/app/vendor/bundle/ruby/2.3.0/gems/signet-0.7.3/lib/signet/oauth_2/client.rb:998:in `fetch_access_token!'
/app/vendor/bundle/ruby/2.3.0/gems/googleauth-0.5.1/lib/googleauth/signet.rb:69:in `fetch_access_token!'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-core-0.20.1/lib/google/cloud/credentials.rb:59:in `initialize'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-core-0.20.1/lib/google/cloud/credentials.rb:77:in `new'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-core-0.20.1/lib/google/cloud/credentials.rb:77:in `block in default'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-core-0.20.1/lib/google/cloud/credentials.rb:76:in `each'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-core-0.20.1/lib/google/cloud/credentials.rb:76:in `default'
/app/vendor/bundle/ruby/2.3.0/gems/google-cloud-datastore-0.20.1/lib/google-cloud-datastore.rb:130:in `datastore'
So this is my thinking about what is going on. The Signet object is trying to fetch a new access_token, but during this process a Faraday::ConnectionFailed error is raised. GRPC handles the Faraday::ConnectionFailed error, and since it doesn't have a valid access_token GRPC raises GRPC::BadStatus with code 16, because the request has no valid authentication. Google Cloud handles the GRPC::BadStatus, and raises Google::Cloud::UnauthenticatedError because that is the proper way to convey error code 16.
The problem encountered while making the API call is that there is no valid access_token to use. The cause of this is something to do with Signet and Faraday. Since you have other errors due to calling fetch_access_token!, my guess is that the root cause is something with the networking and most likely orthogonal to GRPC or Google Cloud. If you solve that problem you will solve this behavior.
I agree the error bubbled up to the application layer is somewhat misleading, but its not technically wrong. If you step into each error's cause you will get to the root problem.
Exceptions in auth code should be wrapped into proper Google::Cloud::Exception, so the retry/error handling code is the same. As a user I don't really care at which API endpoint did call fail - auth or the service. And network errors should surface in uniform way which is Unavailable now, it should not be mixed with real authentication problems.
I looked into this yesterday and last night. I was originally intending to contribute a retry on the token refresh, either by configuring Faraday or putting the retry logic somewhere in Google Auth or Signet. But the problem with that is that the call to refresh an access token is non idempotent, and not something that should be retried when failed. I'm not sure what the right approach is here, so I'd like to hear from some Googlers on google/google-auth-library-ruby#89 before making any changes.
That all said, I still don't understand why Faraday is failing in your production environment vs. your staging environment. My guess is some difference between the two environments is the cause. But I did find an interesting issue about Faraday not retrying on a connection timeout when using the net/http adapter (which is the default adapter). The backtrace you shared shows it is using net/http, so it might be worth it to try setting a different default adapter and seeing if the behavior is the same.
Faraday.default_adapter = :httpclient
@blowmage We have not seen this error since then. My theory is that it may be a form of ingress DoS protection in Google auth service, or egress in Heroku one. It happend exactly one hour after first production deployment of new package using new API, before we were using v1beta2. Once it adapted to (relatively) high traffic from our service it allows connections through. We didn't observe any other networking problems (including google datastore) at the time auth connections were failing.
I'm very glad that this error hasn't been persistent this whole time. We very much look forward to more feedback from you on the new package and new API.
What do you want us to do with this issue? I'm unsure about what changes we can make in google-cloud-ruby to account for this.
I think it would make sense to wrap Faraday::ConnectionFailed into UnavailableError (as well as potentially other Faraday exceptions such as TimeOut into corresponding Google::Cloud::Error classes. May be with additional property (e.g. auth_error?) set.
In general I would separate two concepts: error manifestation and retry logic. Retry logic can be internal or external, but errors popping up should all be consistent to make it work.
@blowmage and I have discussed this, and do not feel that google-cloud-datastore should wrap this exception, since it is unexpected and not related to the behavior of the Datastore service. @timanovsky Is it OK with you if we close this issue?
There is some movement on this. The PR google/google-auth-library-ruby#97 implements retry and incremental backoff on the authentication calls.
@danoscarmike I think we are okay to close this issue since, as @quartzmo points out, any changes to address this will be made to libraries other than google-cloud-* gems. What do you think?
PR google/google-auth-library-ruby#97 has been merged and released in googleauth v0.5.2.
Most helpful comment
There is some movement on this. The PR google/google-auth-library-ruby#97 implements retry and incremental backoff on the authentication calls.