Google-cloud-go: pubsub: StreamingPull api errors hidden or ignored by the sdk

Created on 14 Jun 2019 · 16Comments · Source: googleapis/google-cloud-go

Client

Pubsub

Describe Your Environment

golang 1.12 on GKE

Expected Behavior

Errors returned by the SDK if the pubsub API is returning errors or the ability to inject a logger or automatic re-establishment of broken connections.

Actual Behavior

Errors seem to be ignored and subscription receivers get into a bad state, we can usually resolve the issue by restarting our pods/instances, which will set up the subscription receiver again as part of the instance bootup.

Ultimately the behaviour we see is basically #1444 - a deadlock.

This graph and table is from https://console.cloud.google.com/apis/api/pubsub.googleapis.com/metrics

streamingpull

So the pub/sub api says that close to 100% of our StreamingPull calls are failing. We use this client exclusively for StreamingPull, and in none of our services can I find any hints about what exactly the problem is. I should also mention that we're way, way under quota for StreamingPull open request, open connections, subscriber throughput, etc according to https://console.cloud.google.com/apis/api/pubsub.googleapis.com/quota

Is there a way I can help further diagnose the issue?

pubsub investigating

Source

dwalker-va

👍1

All 16 comments

Could you please provide:

Your SubscriptionConfig.
Your Receive callback function.
The version of this library you're using (e.g. latest is v0.40.0).
A goroutine dump would be useful, too, and any other diagnostic information you could provide.

jadekler on 14 Jun 2019

SubscriptionConfig: We use the default values
Receive Callback function: We notice the issue across dozens of services using different receive callbacks. I will try to reproduce the issue with a simple receive function, but because it's not something I can reproduce on-demand, it may take a while.
Version: v0.40.0 (although we've experienced this issue for a long time with previous versions too)
Diagnostics (open them to make them readable):
I've captured diagnostics for the same service running at 2 different points in time (during normal operation and during an outage via this issue). There's a big contrast in CPU. This was the only contrast I noticed across all of the stackdriver metrics (CPU, Heap, Contention, Allocated Heap, Threads/Goroutines)

CPU time during normal (good) behaviour
goodcpu

CPU time during bad behaviour (not pulling messages):
badcpu

dwalker-va on 14 Jun 2019

I have some experiments going - one is just a forked version of this repo using a logger to investigate which error codes are being returned by the pubsub API when this state happens. I'm mostly interested in whether it is ResourceExhausted: https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1166#issuecomment-443744705

The other experiment I'm running is using unary pull ("synchronous" in the pulloptions) instead of streaming pull.

I'll post any findings.

dwalker-va on 17 Jun 2019

Thanks! From the other example, I have been able to occasionally replicate the issue with the Receive call hanging indefinitely. Not exactly sure what conditions trigger it, but will also report back once we know more.

hongalex on 18 Jun 2019

👍1

my logging says exactly the same, 99% error logging, my app is super simple, i am not using the latest version, was about to update and check, I assume its the same.

since i am noob if you need something from me as well, i will need a slight guide what do you need me to show you. I can post you my entire code btw, its like 50-60 lines i think.

Izopi4a on 18 Jun 2019

The entire day of pubsub errors (some of these would be triggered by context canceled/pod shutdown on our side, although not many as we only restarted once during the day):
Screen Shot 2019-06-18 at 9 18 58 AM

During a single hour during which we had an outage/receive methods hanging:
Screen Shot 2019-06-18 at 9 18 20 AM

I think the most I can conclude from this is that it doesn't seem to be related to #1166 (at least in the sense that ResourceExhausted is special)

dwalker-va on 18 Jun 2019

@dwalker-va How often are you making the requests? It would be helpful to know what your error rate is.

hongalex on 18 Jun 2019

The metrics page on cloud console allows me to filter by credentials (very nice) so I'm able to get stats for just the one service whose error codes I was tracking.

This is for the past day
Screen Shot 2019-06-18 at 1 56 02 PM

Does that help or were you asking for something else?

dwalker-va on 18 Jun 2019

Yeah that helps! So actually, I think the high error rate you're seeing is intended. StreamingPull ends in an error (usually from the stream being broken). From the streaming pull docs:

StreamingPull has a 100% error rate (this is to be expected)

StreamingPull streams are always terminated with an error code. Note that unlike in regular RPCs, the error here is simply an indication that a stream has been broken, not that requests are failing. Therefore, while the StreamingPull API may have a seemingly surprising 100% error rate, this is by design. To diagnose StreamingPull errors, we recommend focusing on StreamingPull message operation metrics, rather than StreamingPull request metrics.

This actually reflects your metrics well, since your Ack calls are going through properly with no errors.

hongalex on 18 Jun 2019

👍1

@hongalex I'm confused at why this has been closed. Ultimately this isn't a question about the error rate of the StreamingPull API, it is a question about why the Receive method hangs indefinitely - I only mentioned the StreamingPull API error rate because it seemed anomalous, but it's not relevant to the actual problem with the SDK.

dwalker-va on 19 Jun 2019

Sorry about that, I misunderstood your original issue between references to the errors. This does seem fairly similar to #1444 as you said, but I'll leave both issues open for the time being.

Just to be sure, are you always ack'ing or nack'ing all messages are they come in? From your last image, it seems like there are about 10% of messages that have been published but not ack'ed. That potentially leads to the issue of hanging StreamingPull calls over time.

hongalex on 19 Jun 2019

Thanks @hongalex, I'm taking a look at changing the behaviour there. Also about unary pull - still runs into the same issue, so I think what you are suggesting is likely the root cause. I'll confirm asap.

dwalker-va on 19 Jun 2019

This is for the last day, the number are within 0.5% of each other, which should be reasonable to chalk up to leading lag (system is currently processing messages it hasn't acked/nacked yet).
Screen Shot 2019-06-21 at 9 35 39 AM
I also haven't had an outage since making this change. Do you have any intuition about what is happening in the client when people are letting 10% of messages time out? People were letting messages time out that they didnt want to try reprocessing immediately - I guess trying to emulate a task queue backoff. This was still a problem when using the synchronous option.

dwalker-va on 21 Jun 2019

Essentially, the resources held by those messages are never released, specifically from MaxOustandingMessages and MaxOutstandingBytes, part of subscription.ReceiveSettings. They semaphores held by these messages are released only when the message is ack or nackd. If you'd like to implement your own form of flow control, you can set these values to -1 as described in #870. There are other ways to use task queue, but the client library itself was not written to support this kind of behavior and is sort of hacky.

hongalex on 21 Jun 2019

👍1

Thanks @hongalex that explains everything I think and this issue can be closed.
I see that issue produced a PR - interested to see what happens with it. We used this mechanism to delay processing in our old python applications, it appears to be possible via the API but the client library actually punishes using the API in that way - which might not be the wrong thing to do, but it is pretty subtle. I know we love to use Pub/Sub because of the amazing throughput - task queues (Cloud Tasks) for example don't offer nearly the same QPS.

dwalker-va on 21 Jun 2019

Great, glad your issue was mostly resolved. At the moment, we aren't planning to add the functionality to the client library, see here. With that said, we are looking to add delayed retries as a native feature of Pub/Sub, since this feature request has been brought up multiple times.

hongalex on 21 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings