Pubsub
golang 1.12 on GKE
Errors returned by the SDK if the pubsub API is returning errors or the ability to inject a logger or automatic re-establishment of broken connections.
Errors seem to be ignored and subscription receivers get into a bad state, we can usually resolve the issue by restarting our pods/instances, which will set up the subscription receiver again as part of the instance bootup.
Ultimately the behaviour we see is basically #1444 - a deadlock.
This graph and table is from https://console.cloud.google.com/apis/api/pubsub.googleapis.com/metrics

So the pub/sub api says that close to 100% of our StreamingPull calls are failing. We use this client exclusively for StreamingPull, and in none of our services can I find any hints about what exactly the problem is. I should also mention that we're way, way under quota for StreamingPull open request, open connections, subscriber throughput, etc according to https://console.cloud.google.com/apis/api/pubsub.googleapis.com/quota
Is there a way I can help further diagnose the issue?
Could you please provide:
v0.40.0).SubscriptionConfig: We use the default values
Receive Callback function: We notice the issue across dozens of services using different receive callbacks. I will try to reproduce the issue with a simple receive function, but because it's not something I can reproduce on-demand, it may take a while.
Version: v0.40.0 (although we've experienced this issue for a long time with previous versions too)
Diagnostics (open them to make them readable):
I've captured diagnostics for the same service running at 2 different points in time (during normal operation and during an outage via this issue). There's a big contrast in CPU. This was the only contrast I noticed across all of the stackdriver metrics (CPU, Heap, Contention, Allocated Heap, Threads/Goroutines)
CPU time during normal (good) behaviour

CPU time during bad behaviour (not pulling messages):

I have some experiments going - one is just a forked version of this repo using a logger to investigate which error codes are being returned by the pubsub API when this state happens. I'm mostly interested in whether it is ResourceExhausted: https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1166#issuecomment-443744705
The other experiment I'm running is using unary pull ("synchronous" in the pulloptions) instead of streaming pull.
I'll post any findings.
Thanks! From the other example, I have been able to occasionally replicate the issue with the Receive call hanging indefinitely. Not exactly sure what conditions trigger it, but will also report back once we know more.
my logging says exactly the same, 99% error logging, my app is super simple, i am not using the latest version, was about to update and check, I assume its the same.
since i am noob if you need something from me as well, i will need a slight guide what do you need me to show you. I can post you my entire code btw, its like 50-60 lines i think.
The entire day of pubsub errors (some of these would be triggered by context canceled/pod shutdown on our side, although not many as we only restarted once during the day):

During a single hour during which we had an outage/receive methods hanging:

I think the most I can conclude from this is that it doesn't seem to be related to #1166 (at least in the sense that ResourceExhausted is special)
@dwalker-va How often are you making the requests? It would be helpful to know what your error rate is.
The metrics page on cloud console allows me to filter by credentials (very nice) so I'm able to get stats for just the one service whose error codes I was tracking.
This is for the past day

Does that help or were you asking for something else?
Yeah that helps! So actually, I think the high error rate you're seeing is intended. StreamingPull ends in an error (usually from the stream being broken). From the streaming pull docs:
StreamingPull has a 100% error rate (this is to be expected)
StreamingPull streams are always terminated with an error code. Note that unlike in regular RPCs, the error here is simply an indication that a stream has been broken, not that requests are failing. Therefore, while the StreamingPull API may have a seemingly surprising 100% error rate, this is by design. To diagnose StreamingPull errors, we recommend focusing on StreamingPull message operation metrics, rather than StreamingPull request metrics.
This actually reflects your metrics well, since your Ack calls are going through properly with no errors.
@hongalex I'm confused at why this has been closed. Ultimately this isn't a question about the error rate of the StreamingPull API, it is a question about why the Receive method hangs indefinitely - I only mentioned the StreamingPull API error rate because it seemed anomalous, but it's not relevant to the actual problem with the SDK.
Sorry about that, I misunderstood your original issue between references to the errors. This does seem fairly similar to #1444 as you said, but I'll leave both issues open for the time being.
Just to be sure, are you always ack'ing or nack'ing all messages are they come in? From your last image, it seems like there are about 10% of messages that have been published but not ack'ed. That potentially leads to the issue of hanging StreamingPull calls over time.
Thanks @hongalex, I'm taking a look at changing the behaviour there. Also about unary pull - still runs into the same issue, so I think what you are suggesting is likely the root cause. I'll confirm asap.
This is for the last day, the number are within 0.5% of each other, which should be reasonable to chalk up to leading lag (system is currently processing messages it hasn't acked/nacked yet).

I also haven't had an outage since making this change. Do you have any intuition about what is happening in the client when people are letting 10% of messages time out? People were letting messages time out that they didnt want to try reprocessing immediately - I guess trying to emulate a task queue backoff. This was still a problem when using the synchronous option.
Essentially, the resources held by those messages are never released, specifically from MaxOustandingMessages and MaxOutstandingBytes, part of subscription.ReceiveSettings. They semaphores held by these messages are released only when the message is ack or nackd. If you'd like to implement your own form of flow control, you can set these values to -1 as described in #870. There are other ways to use task queue, but the client library itself was not written to support this kind of behavior and is sort of hacky.
Thanks @hongalex that explains everything I think and this issue can be closed.
I see that issue produced a PR - interested to see what happens with it. We used this mechanism to delay processing in our old python applications, it appears to be possible via the API but the client library actually punishes using the API in that way - which might not be the wrong thing to do, but it is pretty subtle. I know we love to use Pub/Sub because of the amazing throughput - task queues (Cloud Tasks) for example don't offer nearly the same QPS.
Great, glad your issue was mostly resolved. At the moment, we aren't planning to add the functionality to the client library, see here. With that said, we are looking to add delayed retries as a native feature of Pub/Sub, since this feature request has been brought up multiple times.