Google-cloud-python: Frequent gRPC StatusCode.UNAVAILABLE errors

Created on 4 Nov 2016 · 61Comments · Source: googleapis/google-cloud-python

Using the current codebase from master branch (e1fbb6b), with GRPC, we sometimes (0.5% of requests, approximately) see the following exception:

 AbortionError(code=StatusCode.UNAVAILABLE, details="{"created":"@1478255129.468798425","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1478255129.468756939","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]}"))

Retrying this seem to always succeed.

Should application code have to care about this kind of error and retry? Or is this a bug in google-cloud-pubsub code?

Package versions installed:

gapic-google-logging-v2==0.10.1
gapic-google-pubsub-v1==0.10.1
google-api-python-client==1.5.4
google-cloud==0.20.0
google-cloud-bigquery==0.20.0
google-cloud-bigtable==0.20.0
google-cloud-core==0.20.0
google-cloud-datastore==0.20.1
google-cloud-dns==0.20.0
google-cloud-error-reporting==0.20.0
google-cloud-language==0.20.0
google-cloud-logging==0.20.0
google-cloud-monitoring==0.20.0
google-cloud-pubsub==0.20.0
google-cloud-resource-manager==0.20.0
google-cloud-storage==0.20.0
google-cloud-translate==0.20.0
google-cloud-vision==0.20.0
google-gax==0.14.1
googleapis-common-protos==1.3.5
grpc-google-iam-v1==0.10.1
grpc-google-logging-v2==0.10.1
grpc-google-pubsub-v1==0.10.1
grpcio==1.0.0

Note: Everything google-cloud* comes from git master.

This is on Python 2.7.3

Traceback:

  File "ospdatasubmit/pubsub.py", line 308, in _flush
    publish_response = self.pubsub_client.Publish(publish_request, self._publish_timeout)
  File "grpc/beta/_client_adaptations.py", line 305, in __call__
    self._request_serializer, self._response_deserializer)
  File "grpc/beta/_client_adaptations.py", line 203, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)

bug grpc p2

Source

forsberg

Most helpful comment

Hi, i'm getting this error on PubSub consumer. I manage to get a "not so pretty" workaround.

using a policy like this that replicates code for deadline_exceeded on google.cloud.pubsub_v1.subscriber.policy.thread.Policy.on_exception.

from google.cloud.pubsub_v1.subscriber.policy.thread import Policy
import grpc

class UnavailableHackPolicy(Policy):
    def on_exception(self, exception):
        """
        There is issue on grpc channel that launch an UNAVAILABLE exception now and then. Until
        that issue is fixed we need to protect our consumer thread from broke.
        https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2683
        """
        unavailable = grpc.StatusCode.UNAVAILABLE
        if getattr(exception, 'code', lambda: None)() in [unavailable]:
            print("¡OrbitalHack! - {}".format(exception))
            return
        return super(UnavailableHackPolicy, self).on_exception(exception)

On receive message function i have a code like

subscriber = pubsub.SubscriberClient(policy_class=UnavailableHackPolicy)
subscription_path = subscriber.subscription_path(project, subscription_name)
subscriber.subscribe(subscription_path, callback=callback, flow_control=flow_control)

Problem is that when the resource it is trully UNAVAILABLE we will be not aware.

UPDATE: As noted here by @makrusak and here by @rclough. This hack cause high CPU usage leaving your consumer practically useless (available intermittently). So basically this changes one problem for another, your consumer does not die, but you will have to restart the worker that executes it often.

darofar on 19 Oct 2017

👍5

All 61 comments

@forsberg As you can see from the stack trace, this comes from grpc.beta (the beta interface). We haven't used the beta interface for some time. How are you installing the library?

dhermes on 4 Nov 2016

@nathanielmanistaatgoogle I can consistently reproduce an Unavailable error when a connection goes stale. Is there any way to avoid this, short of retrying on failures?

dhermes on 4 Nov 2016

That library installation seems to have gone horribly wrong. We had an earlier version using the grpc.beta interface, and I guess installing this into the same virtualenv, something went wrong. Will investigate that on Monday.

forsberg on 4 Nov 2016

@forsberg grpc 1.0 still supports the beta interface, but none of google-cloud-python (or its dependencies) use that interface any longer. So it'd be your google-cloud-python install that's b0rked rather than your grpc install.

dhermes on 4 Nov 2016

@nathanielmanistaatgoogle See #2693 and #2699. What is the recommended way to deal with this for stale connections?

dhermes on 7 Nov 2016

Small update: We have fixed our borked google-cloud-python install to actually use e1fbb6bc, but we're still seeing roughly the same number of UNAVAILABLE - retrying always works on first attempt.

forsberg on 17 Nov 2016

@nathanielmanistaatgoogle Bump

dhermes on 17 Nov 2016

@nathanielmanistaatgoogle Bump

/cc @geigerj This is the issue I was referring to about GAPIC retry strategies

dhermes on 28 Nov 2016

@dhermes You can configure this on the GAPIC layer, see comment here for details. It actually looks like we already retry by default on UNAVAILABLE for Pub/Sub Publish, but you can override the default settings to extend the timeout on retry if that's the issue?

geigerj on 28 Nov 2016

@geigerj I believe the correct link to "retry by default on UNAVAILABLE for Pub/Sub Publish" is this one, because the one you've provided no longer works.
The bug affects us too. When the connection is stale, we get exactly the same error as here (grpc._channel._Rendezvous but caused by PubSub publish action). Once we retry, it works.

We're using:

gapic-google-pubsub-v1==0.11.1
google-cloud-core==0.21.0
google-cloud-pubsub==0.21.0
google-gax==0.15.0
googleapis-common-protos==1.5.0
grpc-google-iam-v1==0.11.1
grpc-google-pubsub-v1==0.11.1

What seems strange is that the default retry you mentioned doesn't seem to work. And I have checked that the file publisher_client_config.json is present, with correct values. I get the unhandled exception much sooner than 60s, almost immediately (haven't measured it precisely).

Updated:
It seems that I was wrong about dependency versions, but publisher_client_config.json is the same. I don't think that it will change anything, but I will switch to newest versions and report back.
Actual versions:

gapic-google-pubsub-v1==0.10.1
google-cloud-core==0.21.0
google-cloud-pubsub==0.21.0
google-gax==0.14.1
googleapis-common-protos==1.5.0
grpc-google-iam-v1==0.10.1
grpc-google-pubsub-v1==0.10.1
grpcio==1.0.1

Updated 2:
Newest versions do not fix this issue.
I'm still getting

GaxError(RPC failed, caused by <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1480777925.720435842","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1480777925.720399286","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>)

when using a stale connection. Retrying fixes the issue.

Package versions:

gapic-google-cloud-pubsub-v1==0.14.0
google-cloud-core==0.21.0
git+https://github.com/GoogleCloudPlatform/google-cloud-python.git@a02ac500548cf9fc37f4d81033e696a0efb53f99#egg=google-cloud-pubsub&subdirectory=pubsub
google-gax==0.15.0
googleapis-common-protos==1.5.0
grpc-google-cloud-pubsub-v1==0.14.0
grpc-google-iam-v1==0.11.1
grpcio==1.0.1

(google-cloud-pubsub installed from Git master)

jgoclawski on 3 Dec 2016

Just wanted to add to the above, that disabling gRPC "fixed" the issue ($ export GOOGLE_CLOUD_DISABLE_GRPC=true) - there's no need for manual retrying and there are no errors with stale connections.

jgoclawski on 5 Dec 2016

@dhermes: does that code of yours hit a particular host? If so, are you able to reproduce the problem against any other host? If you're able to hit that host with unauthenticated RPCs, are you able to reproduce the defect in the absence of authentication? If you are able observe the traffic at a low level (with Wireshark or something like it) is there anything obviously the matter? Obviously the expected behavior is that if you hold a grpc.Channel and don't use it for a matter of minutes it should still be able to make RPCs. They may take slightly longer if the underlying TCP connection has been taken down, but they shouldn't fail and then immediately succeed when reattempted.

nathanielmanistaatgoogle on 5 Dec 2016

@dhermes: when I run this code of yours I get StatusCode.PERMISSION_DENIED, so there's more to the reproduction of the problem than merely running that, right? Something has to happen server-side? Possibly something else has to happen locally?

nathanielmanistaatgoogle on 6 Dec 2016

I think I'm seeing a similar issue.

My goal is to have a connection to speech.googleapis.com always open so that whenever a user wants to say something, they can enter a 'y' through terminal and then speak instantly. Otherwise, establishing the connection seems to take about 4 seconds on our architecture.

However, it seems that the connection closes after a while. Would this issue be the cause?

I have taken google's streaming python example code and modified it for my purposes.

def main():
    # Open channel to Google Speech and keep it open indefinitely.
    with cloud_speech.beta_create_Speech_stub(
            make_channel('speech.googleapis.com', 443)) as service:
        answer = ""
        while True:

            # If we're not retrying from a failed attempt, 
            # wait for the user to send 'y' to start recording
            if answer != "retry":
                answer = raw_input("Do you want to record? y/n: ")
            #  pass through raw_input block
            #  in an attempt to retry the streaming
            #  request.
            else:
                answer = "y"

            if answer == "y":
                print("Recieved the Y")

                # For streaming audio from the microphone, there are three threads.
                # First, a thread that collects audio data as it comes in
                with record_audio(RATE, CHUNK) as buffered_audio_data:
                    # Second, a thread that sends requests with that data
                    requests = request_stream(buffered_audio_data, RATE)
                    # Third, a thread that listens for transcription responses
                    recognize_stream = service.StreamingRecognize(
                        requests, DEADLINE_SECS)

                    try:
                        listen_print_loop(recognize_stream)
                        recognize_stream.cancel()

                    except face.CancellationError:
                        pass
                    except face.AbortionError:
                        print("ABORTION ERROR RECEIVED")
                        answer = "retry"

dakrawczyk on 15 Dec 2016

@dakrawczyk I think you might be running into the 1 minute limit for streaming.

See: https://cloud.google.com/speech/limits#content

You said...

establishing the connection seems to take about 4 seconds on our architecture.

Do you know if that connection overhead is on the google-cloud-python side or has something to do with your architecture?

daspecster on 15 Dec 2016

@daspecster I don't think it's the 1 minute limit for streaming, I do know what you're talking about, but I'm not actually streaming until the user enters 'y' and the record/request/response streams are started and used. My understanding is that I'm only just creating the channel to begin with and that doesn't count against the 1 minute timeout. Also I only end up streaming for about 10 seconds at a time.

I am building an embedded system using a samsung artik710 running debian.
When I run

     with cloud_speech.beta_create_Speech_stub(
            make_channel('speech.googleapis.com', 443)) as service:

on my MacBook it is basically instant.

When it runs on my embedded architecture it takes about 4 seconds.

dakrawczyk on 15 Dec 2016

Ok, good to know!

I just realized that your code is actually not using this library. You're using the gRPC library directly.
I don't know that this is the best issue to discuss as it might be kind of drawn out.

If you want you can ping me on https://googlecloud-community.slack.com. I've spent some time in Speech so I might be able to help get you going.

daspecster on 15 Dec 2016

@dakrawczyk Hello! Looks like you're using the sample that I wrote, so I'm here to take blame / responsibility ^_^;

From the symptoms you describe (error happens in time span > streaming limits, re-starting the stream a lot, some auth thing mentioned above), my guess is that the access token is expiring.

The make_channel function grabs an access token the first time it's run (ie when it creates the channel), but doesn't refresh it, so if you keep it open for long enough, soon enough the access token's validity period expires. I imagine each time you start a new stream, grpc is re-sending the auth headers, so eventually you'll start getting auth errors.

If that's the case, you might be able to fix this by modifying make_channel so that it calls get_access_token() on the fly, instead of just getting it upon channel creation.

Let me know how that works. I haven't looked at the google-cloud-python code, but perhaps it's a similar issue?

jerjou on 16 Dec 2016

Also - the sample has since been updated to use the google-auth package, which should also fix that issue.

jerjou on 16 Dec 2016

🎉2

@jerjou Thank you! Trying this out now :]

dakrawczyk on 17 Dec 2016

@jerjou I've updated to the newer sample code that uses the google-auth package.

I am still under the impression that the channel closes after a certain amount of time, probably due to the validity expiring.

Here is the error I receive when trying to send/receive data from the channel after some time and the channel has closed.

Traceback (most recent call last):
  File "transcribe_streaming.py", line 323, in <module>
    main()
  File "transcribe_streaming.py", line 310, in main
    listen_print_loop(recognize_stream)
  File "transcribe_streaming.py", line 239, in listen_print_loop
    for resp in recognize_stream:
  File "/usr/local/lib/python2.7/site-packages/grpc/_channel.py", line 344, in next
    return self._next()
  File "/usr/local/lib/python2.7/site-packages/grpc/_channel.py", line 335, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1482091441.761674000","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1482091441.761631000","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>

A couple of questions:

Is there a way for me to poll the status of the channel and if I detect that it has closed, reopen it?
Or/Also a way to keep the channel open indefinitely?

I can't reopen the channel when the user wants to record because it has a few second delay and that's not the experience we're going for.

dakrawczyk on 18 Dec 2016

How do I join https://googlecloud-community.slack.com? Is it a closed group?

chan71 on 30 Dec 2016

Slack: https://gcp-slack.appspot.com/

jerjou on 5 Jan 2017

@jerjou thanks. I just joined. What is the channel you are in? I could not find a cloud speech channel there. In fact there is one, but it is inactive.

chan71 on 16 Jan 2017

Hey guys, read this thread an still didn't came up a solution..
when trying

GOOGLE_CLOUD_DISABLE_GRPC=true

getting connection timeout because it's trying to localhost:8499

what am I am missing??

ohadperry on 19 Jan 2017

@ohadperry, some of the emulators run on 84xx ports. I'm not sure what your system configuration is though.

daspecster on 19 Jan 2017

@daspecster hi. I understand that it supposed to connect to an emulator.
My question is how to get it connect to my project's production pub sub?
I don't care whether it's RPC or https

ohadperry on 19 Jan 2017

By default, it shouldn't try to connect to the emulator. My guess is that there's probably something in your configuration/environment that's redirecting it.

You could check if the PUBSUB_EMULATOR_HOST environment variable is set. If so then you'll want to unset it.

daspecster on 19 Jan 2017

@daspecster thanks unset my PUBSUB_EMULATOR_HOST from my environment variables worked!!

Any comments as to what the rpc network shouldn't work? RPC should increase The push / pull rate , shouldn't ? if yes, then i'm really interested in why it's not working for me.

ohadperry on 19 Jan 2017

gRPC may increase performance depending on your application.

google-cloud defaults to using gRPC. So unless there was an error during installation, I would guess that it's using PubSub over gRPC already.

```$ pip freeze
grpc-google-cloud-pubsub-v1==0.14.0
grpcio==1.0.4


If you have those two libraries installed then I think you're probably all set.

You can check after you instantiate your `Client` with something like...
```python
client = pubsub.Client()
print(client._use_gax)

If _use_gax is True then the library is using gRPC.

daspecster on 19 Jan 2017

@daspecster , yes I know. I meant i'm getting gRPC StatusCode.UNAVAILABLE errors when using the gRPC here are my pips

grpc-google-iam-v1==0.10.1
grpc-google-pubsub-v1==0.10.1
grpcio==1.0.1

ohadperry on 19 Jan 2017

Those are fairly old. You might want to try updating them, but I can't say that updating will solve the UNAVAILABLE issue. Mostly I think you'll want to code in retries on that event.

daspecster on 19 Jan 2017

The exception thrown is fairly internal. Could the cloud-pubsub library catch the error and throw something more descriptive. Also can the retries be documented?

ensonic on 15 Feb 2017

Upgraded my stack to google-cloud-pubsub==0.22.0. Error is still present, traceback/error message is slightly different. Here's a fresh one:

ERROR 2017-02-22 08:16:31,484
Traceback (most recent call last):
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/ospgcptools/pubsub/__init__.py", line 327, in flush
    self.pubsub_topic.publish(data)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/cloud/pubsub/topic.py", line 253, in publish
    message_ids = api.topic_publish(self.full_name, [message_data])
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/cloud/pubsub/_gax.py", line 173, in topic_publish
    options=options)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/cloud/gapic/pubsub/v1/publisher_client.py", line 290, in publish
    return self._publish(request, options)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/gax/api_callable.py", line 442, in inner
    return api_caller(api_call, this_settings, request)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/gax/api_callable.py", line 70, in inner
    return a_func(request, **kwargs)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/gax/api_callable.py", line 395, in inner
    gax.errors.create_error('RPC failed', cause=exception))
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/gax/api_callable.py", line 391, in inner
    return a_func(*args, **kwargs)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/google/gax/retry.py", line 67, in inner
    return a_func(*updated_args, **kwargs)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/grpc/_channel.py", line 511, in __call__
    return _end_unary_response_blocking(state, False, deadline)
  File "/opt/ospdatasubmit/virtualenvs/ospsubmit.opera.com/v1/local/lib/python2.7/site-packages/grpc/_channel.py", line 459, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
GaxError: GaxError(RPC failed, caused by <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1487751391.483882744","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1487751391.483832140","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":166,"referenced_errors":[{"created":"@1487751391.483828339","description":"Socket closed","fd":67,"file":"src/core/lib/iomgr/tcp_posix.c","file_line":249,"target_address":"ipv6:[2a00:1450:400e:807::200a]:443"}]}]})>)

Some package versions:

pip freeze|egrep 'grpc|pubsub|google-cloud-core|grep protobuf'
gapic-google-cloud-pubsub-v1==0.14.1
google-cloud-core==0.22.1
google-cloud-pubsub==0.22.0
grpc-google-cloud-pubsub-v1==0.14.0
grpc-google-iam-v1==0.11.1
grpc-google-pubsub-v1==0.10.1
grpcio==1.1.0

Timestamp in UTC if some googler wants to look on the other side. Let me know if there's something I can add to my logs to aid in debugging.

In most cases, an immediate retry will fix the problem. Sometimes we have to retry 2 or 3 times (we give up after 3 times and drop the message).

forsberg on 22 Feb 2017

👍3

Also seeing this issue. Retrying within in our own code seems to workaround the issue. we also only retry a max of 3 times. Usually second try fixes it.

We were on 0.18 and just upped to 0.23.
We run python 3.6

$ pip freeze|egrep 'grpc|pubsub|google-cloud-core|grep protobuf'
gapic-google-cloud-pubsub-v1==0.15.0
google-cloud-core==0.23.1
google-cloud-pubsub==0.23.0
grpc-google-cloud-logging-v2==0.90.0
grpc-google-iam-v1==0.11.1
grpcio==1.1.3
proto-google-cloud-pubsub-v1==0.15.1

Traceback (most recent call last):
  File "/opt/app/psutils.py", line 129, in _pub_topic
    return topic.publish(bytes(json.dumps(data), 'utf-8'))
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/cloud/pubsub/topic.py", line 255, in publish
    message_ids = api.topic_publish(self.full_name, [message_data])
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/cloud/pubsub/_gax.py", line 174, in topic_publish
    options=options)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/cloud/gapic/pubsub/v1/publisher_client.py", line 320, in publish
    return self._publish(request, options)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 419, in inner
    return api_caller(api_call, this_settings, request)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 67, in inner
    return a_func(request, **kwargs)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 372, in inner
    gax.errors.create_error('RPC failed', cause=exception))
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/future/utils/__init__.py", line 419, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 368, in inner
    return a_func(*args, **kwargs)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/grpc/_channel.py", line 507, in __call__
    return _end_unary_response_blocking(state, False, deadline)
  File "/home/pythonapp/.pyenv/versions/3.6.0/envs/venv/lib/python3.6/site-packages/grpc/_channel.py", line 455, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
google.gax.errors.GaxError: GaxError(RPC failed, caused by <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1488959268.808794184","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1488959268.808696874","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":166,"referenced_errors":[{"created":"@1488959268.808693234","description":"Socket closed","fd":55,"file":"src/core/lib/iomgr/tcp_posix.c","file_line":249,"target_address":"ipv4:173.194.74.95:443"}]}]})>)

jaredmaxwell on 8 Mar 2017

👍1

I really think my problem is related to this, we have a node.js connecting to a python server using gRpc and we frequently receive this:

Critical: gRPC server raised an error. Error: {"created":"@1489090385.752311821","description":"Endpoint read failed","file":"../src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1489090385.752305292","description":"TCP Read failed","file":"../src/core/lib/iomgr/tcp_uv.c","file_line":170}]} { Error: {"created":"@1489090385.752311821","description":"Endpoint read failed","file":"../src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1489090385.752305292","description":"TCP Read failed","file":"../src/core/lib/iomgr/tcp_uv.c","file_line":170}]} at /app/node_modules/grpc/src/node/src/client.js:434:17 cause: { Error: {"created":"@1489090385.752311821","description":"Endpoint read failed","file":"../src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1489090385.752305292","description":"TCP Read failed","file":"../src/core/lib/iomgr/tcp_uv.c","file_line":170}]} at /app/node_modules/grpc/src/node/src/client.js:434:17 code: 14, metadata: Metadata { _internal_repr: {} } }, isOperational: true, code: 14, metadata: Metadata { _internal_repr: {} } }

Sometimes, the same request on the same server works without any problem.

barroca on 9 Mar 2017

👍3

@barroca This looks like a Node.js failure?

dhermes on 9 Mar 2017

might be, we are creating a client in other language to isolate it.

I've just needed to share my frustration and see if someone had the same problem. Anyway, it's very strange since it is an intermittent error with no pattern, it sometimes happens between seconds, sometimes between minutes. :(

barroca on 9 Mar 2017

@barroca I had the same problem. In my case, If my Node.js run for a while without any request, this error will occur and making request again will get normal response.

Really need help

{ Error: {"created":"@1490271131.819044969","description":"Endpoint read failed","file":"../src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1490271131.819031343","description":"Socket closed","fd":16,"file":"../src/core/lib/iomgr/tcp_posix.c","file_line":249,"target_address":"ipv4:172.16.250.137:8980"}]} 2017-03-23T12:12:11.842630309Z at /usr/local/wongnai/node_modules/grpc/src/node/src/client.js:434:17 2017-03-23T12:12:11.842638153Z cause: 2017-03-23T12:12:11.842643446Z { Error: {"created":"@1490271131.819044969","description":"Endpoint read failed","file":"../src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1490271131.819031343","description":"Socket closed","fd":16,"file":"../src/core/lib/iomgr/tcp_posix.c","file_line":249,"target_address":"ipv4:172.16.250.137:8980"}]} 2017-03-23T12:12:11.842652712Z at /usr/local/wongnai/node_modules/grpc/src/node/src/client.js:434:17 code: 14, metadata: Metadata { _internal_repr: {} } }, isOperational: true, code: 14, metadata: Metadata { _internal_repr: {} } }

chinnawatp on 23 Mar 2017

👍1

I confirm the intermittent errors when working with bigtable API

python 3.5.2
google-cloud==0.23.0

(storm) ➜  storm git:(develop) ✗ vi requirements/base.txt
(storm) ➜  storm git:(develop) ✗ pip freeze G google             
21:gapic-google-cloud-datastore-v1==0.15.3
22:gapic-google-cloud-error-reporting-v1beta1==0.15.3
23:gapic-google-cloud-logging-v2==0.91.3
24:gapic-google-cloud-pubsub-v1==0.15.3
25:gapic-google-cloud-spanner-admin-database-v1==0.15.3
26:gapic-google-cloud-spanner-admin-instance-v1==0.15.3
27:gapic-google-cloud-spanner-v1==0.15.3
28:gapic-google-cloud-speech-v1beta1==0.15.3
29:gapic-google-cloud-vision-v1==0.90.3
31:google-auth==0.10.0
32:google-auth-httplib2==0.0.2
33:google-cloud==0.23.0
34:google-cloud-bigquery==0.23.0
35:google-cloud-bigtable==0.23.1
36:google-cloud-core==0.23.1
37:google-cloud-datastore==0.23.0
38:google-cloud-dns==0.23.0
39:google-cloud-error-reporting==0.23.2
40:google-cloud-language==0.23.1
41:google-cloud-logging==0.23.1
42:google-cloud-monitoring==0.23.0
43:google-cloud-pubsub==0.23.0
44:google-cloud-resource-manager==0.23.0
45:google-cloud-runtimeconfig==0.23.0
46:google-cloud-spanner==0.23.1
47:google-cloud-speech==0.23.0
48:google-cloud-storage==0.23.1
49:google-cloud-translate==0.23.0
50:google-cloud-vision==0.23.3
51:google-gax==0.15.8
52:googleapis-common-protos==1.5.2
55:grpc-google-iam-v1==0.11.1
80:proto-google-cloud-datastore-v1==0.90.3
81:proto-google-cloud-error-reporting-v1beta1==0.15.3
82:proto-google-cloud-logging-v2==0.91.3
83:proto-google-cloud-pubsub-v1==0.15.3
84:proto-google-cloud-spanner-admin-database-v1==0.15.3
85:proto-google-cloud-spanner-admin-instance-v1==0.15.3
86:proto-google-cloud-spanner-v1==0.15.3
87:proto-google-cloud-speech-v1beta1==0.15.3
88:proto-google-cloud-vision-v1==0.90.3
(storm) ➜  storm git:(develop) ✗ pip freeze G grpc  
55:grpc-google-iam-v1==0.11.1
56:grpcio==1.2.1
(storm) ➜  storm git:(develop) ✗

dmitry-saritasa on 28 Apr 2017

👍2

@nathanielmanistaatgoogle Can you weigh in / can we all have a pow-wow about this?

dhermes on 28 Apr 2017

I have same error with:

Client:

[email protected]
python 3.6

Server:

[email protected]
node 6.9

Client error message:

  File "/home/project/venv/lib/python3.6/site-packages/grpc/_channel.py", line 455, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, Endpoint read failed)>

I updated to [email protected] and I'm waiting to see if the error occurs again.

Notes: I'm not sure if Endpoint read failed is same as Secure read failed. I think I had both error from time to time.

sulliwane on 5 May 2017

@dhermes: my apologies for the silence; this is now properly being recognized as a problem in gRPC Core affecting all Core-using languages and will be fixed in Core. In the meantime I don't know that the options are any good: if your RPC is idempotent you can make application-level retries until it succeeds, but if it isn't idempotent you may have to awkwardly work around the issue in a problem-specific way.

So... yay that it's now properly being recognized as a defect. Boo to the rest. Yes to a sync if you'd still like one.

nathanielmanistaatgoogle on 8 May 2017

Thanks @nathanielmanistaatgoogle, is there a tracking issue somewhere?

dhermes on 8 May 2017

@dhermes: there is now; please add any details (pretty please a deterministic reproduction?).

nathanielmanistaatgoogle on 9 May 2017

@nathanielmanistaatgoogle I already gave a deterministic reproduction. I am happy to chat with you off the thread about how to set up the credentials needed for this or we could work together (I'll need your expertise) to create a gRPC service that doesn't require auth to accomplish the same goal.

dhermes on 9 May 2017

@lukesneeringer @dhermes The issue that @nathanielmanistaatgoogle referenced (https://github.com/grpc/grpc/issues/11043) was fixed on June 8. Is this still an issue?

bjwatson on 28 Jul 2017

@bjwatson Checking right now

dhermes on 28 Jul 2017

The example still fails:

Listing all instances:
Traceback (most recent call last):
  File "bt_unavailable.py", line 30, in <module>
    main()
  File "bt_unavailable.py", line 26, in main
    list_em(client)
  File "bt_unavailable.py", line 8, in list_em
    instances, failed_locations = client.list_instances()
  File ".../google/cloud/bigtable/client.py", line 375, in list_instances
    response = self._instance_stub.ListInstances(request_pb)
  File ".../grpc/_channel.py", line 507, in __call__
    return _end_unary_response_blocking(state, call, False, deadline)
  File ".../grpc/_channel.py", line 455, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, Endpoint read failed)>

This fails in Python 2.7 with grpcio==1.4.0

dhermes on 28 Jul 2017

@nathanielmanistaatgoogle, looks like the gRPC fix was insufficient for this issue. Do you have any insight into what else might be going wrong?

FYI @lukesneeringer

bjwatson on 29 Jul 2017

Hello! :-)
First, a _mea culpa_; I have not done as good of a job at keeping up with issues as I should have. If you are getting this (I admit it) cut and paste, it is likely because your issue sat for too long.

In this case, I have been in the process of making a radical update to the PubSub library (#3637) to add significant performance improvements and a new surface, which we hope to launch soon. As such, I am clearing out issues on the old library. It is my sincere goal to do a better job of being on top of issues in the future.

As the preceding paragraph implies, I am closing this issue. If the revamped library does not solve your issue, however, please feel free to reopen.

Thanks!

lukesneeringer on 8 Aug 2017

@lukesneeringer from my recollection, this wasn't Pub/Sub-specific?

geigerj on 8 Aug 2017

That's correct, I even reproduced it with bigtable 12 days ago.

dhermes on 8 Aug 2017

Yeah, I was firing through everything with an api: pubsub label. Thanks for reopening.

lukesneeringer on 8 Aug 2017

I am removing all the "api: X" labels from this issue since issue automation is coming. The grpc label is the appropriate tracking.

Although really this should just be moved to grpc.

lukesneeringer on 8 Aug 2017

Reproduced the Bigtable issue with google-cloud-bigtable==0.26.0 during:

def write_row(key, column_id):
    """
    Utility method for writing a row to BigTable.
    Note that we don't actually store values - the column ids are where we actually store values,
    so the value is always just an empty string.
    :param key: Key of the row to write
    :param column_id: This is the actual value we want stored.
    :return: None
    """
    row = db.table.row(key)
    row.set_cell(conf.column_family_id, column_id, '')
    row.commit()

The error looks like:

_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, Endpoint read failed)>
at _end_unary_response_blocking (/env/local/lib/python2.7/site-packages/grpc/_channel.py:455)
at __call__ (/env/local/lib/python2.7/site-packages/grpc/_channel.py:507)
at commit (/env/local/lib/python2.7/site-packages/google/cloud/bigtable/row.py:417)

sx5640 on 18 Sep 2017

Hi guys,
I'm also facing this issue but with PubSub subscription.
We've started a subscription for one topic yesterday and all is working fine.
However, today we saw this exception on the console.

I'm using the following packages:

google-auth==1.1.1
google-cloud-bigquery==0.27.0
google-cloud-core==0.27.1
google-cloud-logging==1.3.0
google-cloud-pubsub==0.28.3
google-cloud-storage==1.4.0
google-gax==0.15.15
google-resumable-media==0.2.3
googleapis-common-protos==1.5.3
grpc-google-iam-v1==0.11.3
proto-google-cloud-logging-v2==0.91.3

and I'm running on a Linux Machine (Ubuntu 16.04.3 LTS)

Exception:

Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
File "/home/user/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(self._args, *self._kwargs)
File "/home/user/anaconda3/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 248, in _blocking_consume
self._policy.on_exception(exc)
File "/home/user/anaconda3/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/thread.py", line 135, in on_exception
raise exception
File "/home/user/anaconda3/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 234, in _blocking_consume
for response in response_generator:
File "/home/user/anaconda3/lib/python3.6/site-packages/grpc/_channel.py", line 348, in __next__
return self._next()
File "/home/user/anaconda3/lib/python3.6/site-packages/grpc/_channel.py", line 342, in _next
raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, OS Error)>

ericbbraga on 5 Oct 2017

I just got this error while running a job on Google ML Engine.

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 296, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 793, in train train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 546, in train_step if sess.run(train_step_kwargs['should_log']): File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) UnavailableError: {"created":"@1507348685.699643392","description":"EOF","file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14}

Honestly not sure how to solve other than retry the job.

Edit:

And rerunning does not seem to help. Keep getting the same error.

jkurz on 7 Oct 2017

Hi, i'm getting this error on PubSub consumer. I manage to get a "not so pretty" workaround.

using a policy like this that replicates code for deadline_exceeded on google.cloud.pubsub_v1.subscriber.policy.thread.Policy.on_exception.

from google.cloud.pubsub_v1.subscriber.policy.thread import Policy
import grpc

class UnavailableHackPolicy(Policy):
    def on_exception(self, exception):
        """
        There is issue on grpc channel that launch an UNAVAILABLE exception now and then. Until
        that issue is fixed we need to protect our consumer thread from broke.
        https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2683
        """
        unavailable = grpc.StatusCode.UNAVAILABLE
        if getattr(exception, 'code', lambda: None)() in [unavailable]:
            print("¡OrbitalHack! - {}".format(exception))
            return
        return super(UnavailableHackPolicy, self).on_exception(exception)

On receive message function i have a code like

subscriber = pubsub.SubscriberClient(policy_class=UnavailableHackPolicy)
subscription_path = subscriber.subscription_path(project, subscription_name)
subscriber.subscribe(subscription_path, callback=callback, flow_control=flow_control)

Problem is that when the resource it is trully UNAVAILABLE we will be not aware.

darofar on 19 Oct 2017

👍5

I might be getting a similar problem on spanner trying to read ranges with index. I will need to test if it's my code or not.

chemelnucfin on 12 Nov 2017

I think with all the work that @dhermes did on pubsub this should be resolved. I'm going to go ahead and close this, but if it's still reproducible with the latest version we can re-open.