Hi,
Over the weekend I've been experiencing abnormal behaviour with Pub/Sub. Since Friday at 4pm ET I've experienced numerous cases were my services will stop processing messages. I can see in stackdriver that messages are being queued but not processed. Restarting the services allows the messages to be processed until it stops again.
This seems to be a issue with v0.8.x and less with v0.4.0. In v0.8.x there is a infinite timeout (DefaultPubSubRpc.java) and in v0.4.0 there doesn't seem to be (DefaultPubSubRpc.java).
In my experience setting a infinite timeout is rarely a good idea. Is there a reason why you do this?
Possibly seeing the same problem, were doing subscription.pullAsync(..) and sometimes messages just stop being processed and build up until the consumer is restarted. Considering moving to pull() and managing threading / timeouts ourselves.
Garrett
On Thu, Feb 2, 2017 at 6:21 PM -0500, "nhoughto" notifications@github.com wrote:
Possibly seeing the same problem, were doing subscription.pullAsync(..) and sometimes messages just stop being processed and build up until the consumer is restarted. Considering moving to pull() and managing threading / timeouts ourselves.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Good to know, thanks.
Interesting that the MessageConsumer obj doesn't expose an isAlive() or something similar to detect failure, so the assumption is that it manages it for you. Presume even if gRPC LBs are misbehaving the gRPC client should handle the failure mode itself.
Setting timeout to infinity does indeed sound like a problem. The motivation of making the infinite timeout seems to be https://github.com/GoogleCloudPlatform/google-cloud-java/issues/1157 .
We are currently working on a higher-performance PubSub client, currently in the pubsub-hp branch. In this new implementation, the timeout is properly set.
The difference is that now the RPC will return immediately if there's no message available instead of waiting. It will exponentially back off though (max 10s). You might be making more RPCs with the new client, though I don't think the difference will be big enough to be a problem. Once streaming pull becomes available, the difference should become moot.
@GEverding @nhoughto Do you think this will help your use case?
Yep that sounds good. Happy with immediate return with a backoff (seems like a good tradeoff).
Any idea on timelines for landing pubsub-hp or streaming pull changes? We are going to halt our pubsub roll out to new components until we sort this out.
That seems like a much better idea. Is there a eta for when that branch will be ready?
We expect to release it next week (sometime Feb 7-Feb 10).
I'm still encountering this issue. Bye changing the way I poll (calling pullAsync but with a limit and retrying automatically), I finally got this exception (pasted below)
Don't know if it's connected to the current problem though.
java.util.concurrent.ExecutionException: com.google.cloud.pubsub.PubSubException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
17:15:04.000
Received Goaway
17:15:04.000
max_age
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
17:15:04.000
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
17:15:04.000
at com.google.api.gax.core.ForwardingRpcFuture.get(ForwardingRpcFuture.java:51)
@BodySplash I don't think this is immediately related. According to pubsub's error code doc, this seems to be an issue that "happens sometimes". That said, the new client will do the retry for you.
The high-performance rewrite was merged and released. I recommend trying version 0.9.3-alpha. @nhoughto and @GEverding can you let us know how it works?
We're integrating today. I'll get back to you
--
Garrett
On Wed, Feb 22, 2017 at 9:03 PM -0500, "garrettjonesgoogle" notifications@github.com wrote:
The high-performance rewrite was merged and released. I recommend trying version 0.9.3-alpha. @nhoughto and @GEverding can you let us know how it works?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Looking at 0.9.3-alpha, I assume the new high-performance changes are found in the com.google.cloud.pubsub.spi.v1 classes rather than the now deprecated classes? Any examples of how the new classes should be used? Readme / examples all reference the deprecated stuff.
@nhoughto That is correct. For an example of using Subscriber, could you see if this helps you?
We seem to have missed the README. I have reported this at #1664.
Yep that's useful, so the SubscriptionListener exposes the listener failure so you can manually restart it? Rather than handling it for you and auto reconnecting as per the buggy deprecated impl? We shouldn't expect a startAsync() to handle reconnects and should build some auto reconnect with backoff etc logic ourselves?
Subscriber uses Guava's Service behind the scene. This document should provide an overview.
To answer your question specifically, Subscriber does attempt some retries on its own. Eg, pubsub service replying with error code "UNAVAILABLE" usually signifies a transient condition; Subscriber knows this and will sleep for a bit then try again.
If it encounters a non-retryable error (or sees retryable errors too many times consecutively), it will call the SubscriberListener::failed method. If you want, you can use it to detect and "restart" the Subscriber by creating a new Subscriber object (Services cannot go back to running state after failing). Does this answer your question?
Yep seems reasonable, I'll give it a try 👍🏼
I'm going to go ahead and resolve this now.
Most helpful comment
We expect to release it next week (sometime Feb 7-Feb 10).