Google-cloud-java: Pub/Sub Messages stop being Processed

Created on 30 Jan 2017 · 17Comments · Source: googleapis/google-cloud-java

Hi,
Over the weekend I've been experiencing abnormal behaviour with Pub/Sub. Since Friday at 4pm ET I've experienced numerous cases were my services will stop processing messages. I can see in stackdriver that messages are being queued but not processed. Restarting the services allows the messages to be processed until it stops again.

This seems to be a issue with v0.8.x and less with v0.4.0. In v0.8.x there is a infinite timeout (DefaultPubSubRpc.java) and in v0.4.0 there doesn't seem to be (DefaultPubSubRpc.java).

In my experience setting a infinite timeout is rarely a good idea. Is there a reason why you do this?

pubsub bug

Source

GEverding

Most helpful comment

We expect to release it next week (sometime Feb 7-Feb 10).

garrettjonesgoogle on 3 Feb 2017

🎉3

All 17 comments

Possibly seeing the same problem, were doing subscription.pullAsync(..) and sometimes messages just stop being processed and build up until the consumer is restarted. Considering moving to pull() and managing threading / timeouts ourselves.

nhoughto on 3 Feb 2017

I filed a ticket with cloud support. The response I got after some back and forth was it's is a issue with there grpc load balancers for pub sub. They've been slowly rolling out a fix since yesterday.

Garrett

On Thu, Feb 2, 2017 at 6:21 PM -0500, "nhoughto" notifications@github.com wrote:

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

GEverding on 3 Feb 2017

Good to know, thanks.

Interesting that the MessageConsumer obj doesn't expose an isAlive() or something similar to detect failure, so the assumption is that it manages it for you. Presume even if gRPC LBs are misbehaving the gRPC client should handle the failure mode itself.

nhoughto on 3 Feb 2017

Setting timeout to infinity does indeed sound like a problem. The motivation of making the infinite timeout seems to be https://github.com/GoogleCloudPlatform/google-cloud-java/issues/1157 .

We are currently working on a higher-performance PubSub client, currently in the pubsub-hp branch. In this new implementation, the timeout is properly set.

The difference is that now the RPC will return immediately if there's no message available instead of waiting. It will exponentially back off though (max 10s). You might be making more RPCs with the new client, though I don't think the difference will be big enough to be a problem. Once streaming pull becomes available, the difference should become moot.

@GEverding @nhoughto Do you think this will help your use case?

pongad on 3 Feb 2017

Yep that sounds good. Happy with immediate return with a backoff (seems like a good tradeoff).

Any idea on timelines for landing pubsub-hp or streaming pull changes? We are going to halt our pubsub roll out to new components until we sort this out.

nhoughto on 3 Feb 2017

That seems like a much better idea. Is there a eta for when that branch will be ready?

GEverding on 3 Feb 2017

We expect to release it next week (sometime Feb 7-Feb 10).

garrettjonesgoogle on 3 Feb 2017

🎉3

I'm still encountering this issue. Bye changing the way I poll (calling pullAsync but with a limit and retrying automatically), I finally got this exception (pasted below)
Don't know if it's connected to the current problem though.

java.util.concurrent.ExecutionException: com.google.cloud.pubsub.PubSubException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
17:15:04.000
Received Goaway
17:15:04.000
max_age
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
17:15:04.000
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
17:15:04.000
at com.google.api.gax.core.ForwardingRpcFuture.get(ForwardingRpcFuture.java:51)

BodySplash on 7 Feb 2017

@BodySplash I don't think this is immediately related. According to pubsub's error code doc, this seems to be an issue that "happens sometimes". That said, the new client will do the retry for you.

pongad on 7 Feb 2017

The high-performance rewrite was merged and released. I recommend trying version 0.9.3-alpha. @nhoughto and @GEverding can you let us know how it works?

garrettjonesgoogle on 23 Feb 2017

We're integrating today. I'll get back to you

--
Garrett

On Wed, Feb 22, 2017 at 9:03 PM -0500, "garrettjonesgoogle" notifications@github.com wrote:

The high-performance rewrite was merged and released. I recommend trying version 0.9.3-alpha. @nhoughto and @GEverding can you let us know how it works?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

GEverding on 23 Feb 2017

Looking at 0.9.3-alpha, I assume the new high-performance changes are found in the com.google.cloud.pubsub.spi.v1 classes rather than the now deprecated classes? Any examples of how the new classes should be used? Readme / examples all reference the deprecated stuff.

nhoughto on 25 Feb 2017

@nhoughto That is correct. For an example of using Subscriber, could you see if this helps you?

We seem to have missed the README. I have reported this at #1664.

pongad on 26 Feb 2017

Yep that's useful, so the SubscriptionListener exposes the listener failure so you can manually restart it? Rather than handling it for you and auto reconnecting as per the buggy deprecated impl? We shouldn't expect a startAsync() to handle reconnects and should build some auto reconnect with backoff etc logic ourselves?

nhoughto on 27 Feb 2017

Subscriber uses Guava's Service behind the scene. This document should provide an overview.

To answer your question specifically, Subscriber does attempt some retries on its own. Eg, pubsub service replying with error code "UNAVAILABLE" usually signifies a transient condition; Subscriber knows this and will sleep for a bit then try again.

If it encounters a non-retryable error (or sees retryable errors too many times consecutively), it will call the SubscriberListener::failed method. If you want, you can use it to detect and "restart" the Subscriber by creating a new Subscriber object (Services cannot go back to running state after failing). Does this answer your question?

pongad on 27 Feb 2017

Yep seems reasonable, I'll give it a try 👍🏼

nhoughto on 27 Feb 2017

I'm going to go ahead and resolve this now.

garrettjonesgoogle on 11 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings