PubSub
Alpine:3.7 on GCE
Receive retries and eventually returns an error
Receive hangs indefinitely, never returning an error or continuing to process messages. Because of this the application that is using pubsub client with Receive is completely locked and I must restart the docker container.
I have been noticing that messages stop processing periodically using pubsub client in my application. I am handiling Receive errors properly, however an error is never returned. I then added logging into my app with the Receive callback. As soon as I notice messages are no longer processing, nothing is logged within the callback and Receive is completely locked up. I am not sure how to handle this.
I added logging within the pupsub isRetryable(err error) bool function. and each time I experience this it seems there is a corresponding Unavailable log:
Unavailable The service was unable to fulfill your request. Please try again. [code=8a75]
Oddly, it seems that isRetryable should return true and retry in this case, so I am unsure where the lock is happening. An immediate restart of the Docker container resolves the issue, but only for a while.
This seems similar to issue:
https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1156
However, in that issue the reported error is the cause of a Context error, which I am not seeing and therefore I think they may be separate bugs.
Here is an example of a simple subscriber package I created that uses pubsub client:
https://gist.github.com/erickertz/2090853580dcead48682886b2cb7f0d4
Relevant: https://github.com/GoogleCloudPlatform/google-cloud-go/issues/740
Correct me if I'm wrong, but I understand this as you're experiencing a Service Unavailable error [1], which causes retry to occur, but you never stop seeing this error and subsequently retry indefinitely. Is that the case, or is there some other deadlock you're observing?
1:
Unavailable The service was unable to fulfill your request. Please try again. [code=8a75]
@jadekler Thanks for the quick reply. I do stop seeing this error (Unavailable Code = 14) as it appears only once. But as soon as that error occurs, nothing in my Receive callback loop gets executed and at the same time no errors are returned by Receive, therefore it is completely locked. At this point I'm unsure where exactly the lock is happening. I just added even more logging to try and track it down.
I haven't found a way to intentionally reproduce yet, however I am experiencing it about once a day now. It has gone for a few days in the past without issue.
I have put logging in each isRetryable error code case. Interestingly, I do see errors getting caught in the other : codes.DeadlineExceeded, codes.Internal, codes.ResourceExhausted and they do recover. It seems as though only the codes.Unavailable case locks up. Also, the codes.Unavailable returns true as it is not a "Server shutdownNow invoked".
Here is my logging in isRetryable:
https://gist.github.com/erickertz/17b0f580e154fcb5df166b831bc24aba
I added more logging in openWithRetry and call functions to see if this helps clarify anything. Thanks again!
Thanks for this detailed report. I'll investigate shortly and try to repro.
Can you add logging in the two places after isRetryable returns?
@jba These are the two places I found where openWithRetry is used and I loaded them up pretty good since the last time it got hung up:
https://gist.github.com/erickertz/07d6abfec0430d83dfb44049f89703be
https://gist.github.com/erickertz/914c0ab1507c973b519f699b9d0b15de
Also, for what it's worth, I'm running in GCE not CKE. I have also tried using both service account and json credentials. Before I got too far into this I read of another issue that sounded similar and it was suggested to switch creds type, so I tried it. Did't seems to make a difference. Thanks again, I'll post a follow up comment with the new logs as soon as I experience this happening again.
@erickertz I tried and failed to replicate your case. The linked test sets up the server to respond with "Unavailable", msg, msg, msg, then expects the client to be able to handle the first unavailable and then process 3 messages - the client seems to handle it just fine.
Just to double check - you're certain that there's only a single UNAVAILABLE that you receive? When you receive one of these messages, the client enters multiplicative backoff (with jitter). If you were receiving multiple it may be that the client is just waiting a large number of seconds before retrying (the max is 30s).
Also, if you have any kind of reproduction we could try, that would be hugely appreciated. :)
@jadekler I just rechecked my logs and I am certain that "Unavailable" is only showing up once, right at the time where Receive stopped logging and I stopped ingesting messages. I'm still uncertain that this is the issue however as it has only happened twice since I added all the logging.
As far as timing goes, I don't think thats the issue. I've seen it locked up for a whole weekend once while I was away. At the same time, I've done immediate restarts and it starts ingesting again just fine.
I appreciate your responses. Do you mind if I suggest we keep this open until I am able to provide even further details from my additional logging?
Ah ok, that does seem very fishy. Yes, absolutely, let's keep this open until we figure out the problem. I'll hang ten until we hear back from you with more details.
FYI, I have been observing similar symptoms over the last week or so too. f623f34 looks suspicious (it makes some things non-retryable), though 539de1d might also be related.
Thanks for the note. Two is enough for me to definitely think there's an issue. I'll try to repro by making a long-running application.
@dsymonds, could you try running using the commit prior to the two you mentioned?
I rewound to c0badda (so a bit earlier, but I was most interested in getting our production system back into a stable state) and it's been fine for about the last 24h.
I think I may have found the source of the problem. See https://code-review.googlesource.com/c/gocloud/+/33711 for a fix. Please try with this updated code and let me know if y'all still see issues. Sincere apologies for the inconvenience, it was my fault.
@jadekler @jba I don't want to speak too soon, but I think I got to the bottom of my issue. I think it was two part.
I believe what I am experiencing is exactly the issue reported in #1156 where cancel context is causing a lockup.
I overlooked my implementation of context in my app and it was being misused. My app subscribes to two separate subscriptions and if certain conditions were met (in this case, consecutive failures to write to a data source) I wanted all the subscriptions to die. Therefore, I shared a cancellable context between the two. My tests showed that this should work, however it has never actually happened so I didn't see it perform.
The issue is that I also mistakenly used the same context when initializing the client: pubsub.NewClient() . I didn't follow this down to see where or what exactly happens to context in creating the client, but its obvious that this is wrong and could cause issues.
So I think the context was being cancelled out without me knowing, then I was experiencing the same lockup as mentioned in #1156. I put the fix in from #1156 also fixed context usage in my own code this morning. I haven't seen the issue occur since, fingers crossed.
Also, Im not sure why exactly the codes.Unavailable was being logged at the same exact time as lockup here. The additional logging in the retry always retuned nil error, as expected after retrying.
I'm not 100% sure if my this is fixed but its been about 8 hours now and looking good. Thanks for everyones help!
@erickertz Great to hear. I'll leave this issue open over the weekend and close early next week if y'all are not seeing anything suspicious. Note that you should get the latest code as of now, since we just recently fixed a deadlock as well as removed Canceled from the retry logic as described in https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1156.
Unfortunately I spoke too soon :( multiple lockups so far this weekend. I'm going to move to the latest master hopefully today and let it run for a while. Will report back. Thanks.
Yes, I also observed the problem in my system, using HEAD (built at 2018-09-29 21:40:41 +0000 UTC).
@jacobsa @dsymonds Unfortunately, can still confirm multiple lockups with the latest master. I cannot provide any further logs as this is so new. However, am willing to add logs where necessary if anyone can point me in the right direction.
@dsymonds was this stable for you at any release or revision for you?
I don't have super-detailed records, but it was stable before 21 September, and then started getting unreliable after then. However, as mentioned above, rewinding to c0badda (1 Sep) didn't seem to sort it out properly either, but that was only pinning the cloud.google.com/go repository. It's possible that some dependency (e.g. gRPC) has an unexpected interaction.
looks like the latest actually did return an error that i logged:
"rpc error: code = Unavailable desc = transport is closing"
UPDATE: actually, each of the two times since the latest code when my app stopped processing messages it DID return the error above. So thats certainly a good thing.
Should it be retrying with "Unavailable desc = transport is closing" error though?
Thanks, we recently discovered this independently as well. Some of the RPCs are not retrying on Unavailable.
For now, please retry Receive if you see this error. We're working on a fix that should make that unnecessary.
@jba thanks! so far no lockups since the new code updates :)
It's more than just Unavailable. This morning, running a build from master yesterday, I got a situation where no messages were processed at all for about three hours. Only after that length of time did the Unavailable code get returned and my retry workaround kick in.
@dsymonds thats unfortunate. I haven't experienced lockup yet since the fixes. I did put another ERBS style retry in for for all errors that are returned though using the package:
https://github.com/cenkalti/backoff
Here is a sample of my subscription wrapper package with retries:
https://gist.github.com/erickertz/fe0ec9e66ea957572ba036ba98211f30
@dsymonds Would you mind adding some logging to cloud.google.com/go/pubsub so we can see what's going on? I'd like to understand first whether some RPC is just blocking, or we're in a retry loop. Let me know if you want me to suggest places to log.
Our build is such that it would be easiest if you could make a named branch and add any logging in a separate commit. Then we can merely check out that branch during our build. Does that work for you?
We're running 8091b50 on GCE and we're potentially noticing the same issue. The following graphs are from the opencensus metrics for a single queue on a per minute basis summed across 12 servers.
We were processing about 360k jobs a minute until 23:50 UTC when it plummeted:

Around the same time we can see retries:

Request stream count went down as well (as well as responses, naturally):

Here's a graph of the open count:

Unfortunately, I was unable to grab a goroutine dump during the time period that it was low and as you can see, it jumped back up after an hour and with a little over 18 million undelivered it's catching up.
Let me know if there's more information I can provide. I'll try and get a goroutine dump when it happens again, if that will help.
We also had a service running ce3259d that had the same issue at the same time and take this last one with a grain of salt but another service running f8d6af3 didn't seem to have the issue (but this service only processes jobs in batches every hour so could also be the workload difference that didn't cause it).
It happened again and here is a goroutine dump with the relevant goroutines:
605 @ 0x42ca3a 0x427d5a 0x4273d7 0x49a29b 0x49a31d 0x49b17d 0x514ccf 0x525d7a 0x623866 0x623d70 0x627310 0x545dc8 0x463706 0x463878 0x93870b 0x938f94 0x955762 0x459d51
# 0x4273d6 internal/poll.runtime_pollWait+0x56 /usr/local/go/src/runtime/netpoll.go:173
# 0x49a29a internal/poll.(*pollDesc).wait+0x9a /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
# 0x49a31c internal/poll.(*pollDesc).waitRead+0x3c /usr/local/go/src/internal/poll/fd_poll_runtime.go:90
# 0x49b17c internal/poll.(*FD).Read+0x17c /usr/local/go/src/internal/poll/fd_unix.go:157
# 0x514cce net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_unix.go:202
# 0x525d79 net.(*conn).Read+0x69 /usr/local/go/src/net/net.go:176
# 0x623865 crypto/tls.(*block).readFromUntil+0x95 /usr/local/go/src/crypto/tls/conn.go:493
# 0x623d6f crypto/tls.(*Conn).readRecord+0xdf /usr/local/go/src/crypto/tls/conn.go:595
# 0x62730f crypto/tls.(*Conn).Read+0xff /usr/local/go/src/crypto/tls/conn.go:1156
# 0x545dc7 bufio.(*Reader).Read+0x237 /usr/local/go/src/bufio/bufio.go:216
# 0x463705 io.ReadAtLeast+0x85 /usr/local/go/src/io/io.go:309
# 0x463877 io.ReadFull+0x57 /usr/local/go/src/io/io.go:327
# 0x93870a golang.org/x/net/http2.readFrameHeader+0x7a /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/golang.org/x/net/http2/frame.go:237
# 0x938f93 golang.org/x/net/http2.(*Framer).ReadFrame+0xa3 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/golang.org/x/net/http2/frame.go:492
# 0x955761 google.golang.org/grpc/internal/transport.(*http2Client).reader+0x171 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/http2_client.go:1205
605 @ 0x42ca3a 0x43c5b0 0x9487e3 0x948f18 0x95c060 0x459d51
# 0x9487e2 google.golang.org/grpc/internal/transport.(*controlBuffer).get+0x132 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/controlbuf.go:317
# 0x948f17 google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x1a7 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/controlbuf.go:435
# 0x95c05f google.golang.org/grpc/internal/transport.newHTTP2Client.func3+0x7f /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/http2_client.go:319
41 @ 0x42ca3a 0x42caee 0x404242 0x403efb 0xa23ff8 0x459d51
# 0xa23ff7 cloud.google.com/go/pubsub.(*Subscription).receive.func1+0x47 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:512
35 @ 0x42ca3a 0x43c5b0 0x9712d7 0x96c500 0x977f14 0x977282 0x96850c 0x9682c3 0xa07a92 0xa16248 0x980261 0x980012 0xa128d0 0xa23447 0xa1c16e 0xa1bea0 0xa1b8b2 0x459d51
# 0x9712d6 google.golang.org/grpc.(*pickerWrapper).pick+0x566 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/picker_wrapper.go:124
# 0x96c4ff google.golang.org/grpc.(*ClientConn).getTransport+0x6f /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/clientconn.go:721
# 0x977f13 google.golang.org/grpc.(*clientStream).newAttemptLocked+0x153 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:309
# 0x977281 google.golang.org/grpc.newClientStream+0x7c1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:269
# 0x96850b google.golang.org/grpc.invoke+0x9b /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/call.go:66
# 0x9682c2 google.golang.org/grpc.(*ClientConn).Invoke+0x1b2 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/call.go:37
# 0xa07a91 google.golang.org/genproto/googleapis/pubsub/v1.(*subscriberClient).Acknowledge+0xd1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/genproto/googleapis/pubsub/v1/pubsub.pb.go:3046
# 0xa16247 cloud.google.com/go/pubsub/apiv1.(*SubscriberClient).Acknowledge.func1+0x77 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/subscriber_client.go:321
# 0x980260 github.com/googleapis/gax-go.invoke+0x90 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/github.com/googleapis/gax-go/invoke.go:70
# 0x980011 github.com/googleapis/gax-go.Invoke+0xe1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/github.com/googleapis/gax-go/invoke.go:48
# 0xa128cf cloud.google.com/go/pubsub/apiv1.(*SubscriberClient).Acknowledge+0x1af /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/subscriber_client.go:319
# 0xa23446 cloud.google.com/go/pubsub.(*messageIterator).sendAck.func1+0x106 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:384
# 0xa1c16d cloud.google.com/go/pubsub.(*messageIterator).sendAckIDRPC+0x22d /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:448
# 0xa1be9f cloud.google.com/go/pubsub.(*messageIterator).sendAck+0x4f /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:379
# 0xa1b8b1 cloud.google.com/go/pubsub.(*messageIterator).sender+0x491 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:336
30 @ 0x42ca3a 0x43c5b0 0x95abc0 0x95a957 0x95b775 0x463706 0x463878 0x95b6bf 0x973975 0x97461d 0x97a205 0x97ce5d 0x978f9e 0x9799a0 0xa07e92 0xa23c91 0xa1d74c 0xa1d9f9 0xa1b3a3 0xa1b0f2 0xa211d6 0xa23f87 0xa17837 0x459d51
# 0x95abbf google.golang.org/grpc/internal/transport.(*recvBufferReader).read+0x1df /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:142
# 0x95a956 google.golang.org/grpc/internal/transport.(*recvBufferReader).Read+0x66 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:131
# 0x95b774 google.golang.org/grpc/internal/transport.(*transportReader).Read+0x54 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:401
# 0x463705 io.ReadAtLeast+0x85 /usr/local/go/src/io/io.go:309
# 0x463877 io.ReadFull+0x57 /usr/local/go/src/io/io.go:327
# 0x95b6be google.golang.org/grpc/internal/transport.(*Stream).Read+0xbe /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:385
# 0x973974 google.golang.org/grpc.(*parser).recvMsg+0x64 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/rpc_util.go:476
# 0x97461c google.golang.org/grpc.recv+0x4c /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/rpc_util.go:605
# 0x97a204 google.golang.org/grpc.(*csAttempt).recvMsg+0xe4 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:762
# 0x97ce5c google.golang.org/grpc.(*clientStream).RecvMsg.func1+0x3c /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:648
# 0x978f9d google.golang.org/grpc.(*clientStream).withRetry+0x27d /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:530
# 0x97999f google.golang.org/grpc.(*clientStream).RecvMsg+0x7f /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:647
# 0xa07e91 google.golang.org/genproto/googleapis/pubsub/v1.(*subscriberStreamingPullClient).Recv+0x61 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/genproto/googleapis/pubsub/v1/pubsub.pb.go:3087
# 0xa23c90 cloud.google.com/go/pubsub.(*pullStream).Recv.func1+0x70 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:172
# 0xa1d74b cloud.google.com/go/pubsub.(*pullStream).call+0xcb /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:132
# 0xa1d9f8 cloud.google.com/go/pubsub.(*pullStream).Recv+0x58 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:169
# 0xa1b3a2 cloud.google.com/go/pubsub.(*messageIterator).recvMessages+0x32 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:251
# 0xa1b0f1 cloud.google.com/go/pubsub.(*messageIterator).receive+0x601 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:194
# 0xa211d5 cloud.google.com/go/pubsub.(*Subscription).receive+0x3f5 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:545
# 0xa23f86 cloud.google.com/go/pubsub.(*Subscription).Receive.func2+0x56 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:489
# 0xa17836 golang.org/x/sync/errgroup.(*Group).Go.func1+0x56 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/golang.org/x/sync/errgroup/errgroup.go:58
30 @ 0x42ca3a 0x43c5b0 0x97c94e 0x459d51
# 0x97c94d google.golang.org/grpc.newClientStream.func5+0xfd /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:287
11 @ 0x42ca3a 0x43c5b0 0x9712d7 0x96c500 0x977f14 0x977282 0x9768ff 0xa07cea 0xa163de 0x980261 0x980012 0xa12e44 0xa249a3 0xa23969 0xa1d548 0xa1d337 0xa1d6e8 0xa1d9f9 0xa1b3a3 0xa1b0f2 0xa211d6 0xa23f87 0xa17837 0x459d51
# 0x9712d6 google.golang.org/grpc.(*pickerWrapper).pick+0x566 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/picker_wrapper.go:124
# 0x96c4ff google.golang.org/grpc.(*ClientConn).getTransport+0x6f /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/clientconn.go:721
# 0x977f13 google.golang.org/grpc.(*clientStream).newAttemptLocked+0x153 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:309
# 0x977281 google.golang.org/grpc.newClientStream+0x7c1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:269
# 0x9768fe google.golang.org/grpc.(*ClientConn).NewStream+0x17e /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:145
# 0xa07ce9 google.golang.org/genproto/googleapis/pubsub/v1.(*subscriberClient).StreamingPull+0xa9 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/genproto/googleapis/pubsub/v1/pubsub.pb.go:3063
# 0xa163dd cloud.google.com/go/pubsub/apiv1.(*SubscriberClient).StreamingPull.func1+0x7d /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/subscriber_client.go:359
# 0x980260 github.com/googleapis/gax-go.invoke+0x90 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/github.com/googleapis/gax-go/invoke.go:70
# 0x980011 github.com/googleapis/gax-go.Invoke+0xe1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/github.com/googleapis/gax-go/invoke.go:48
# 0xa12e43 cloud.google.com/go/pubsub/apiv1.(*SubscriberClient).StreamingPull+0x1d3 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/subscriber_client.go:357
# 0xa249a2 cloud.google.com/go/pubsub/apiv1.(*SubscriberClient).StreamingPull-fm+0x62 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:77
# 0xa23968 cloud.google.com/go/pubsub.newPullStream.func1+0x158 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:47
# 0xa1d547 cloud.google.com/go/pubsub.(*pullStream).openWithRetry+0x77 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:108
# 0xa1d336 cloud.google.com/go/pubsub.(*pullStream).get+0x106 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:100
# 0xa1d6e7 cloud.google.com/go/pubsub.(*pullStream).call+0x67 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:127
# 0xa1d9f8 cloud.google.com/go/pubsub.(*pullStream).Recv+0x58 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:169
# 0xa1b3a2 cloud.google.com/go/pubsub.(*messageIterator).recvMessages+0x32 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:251
# 0xa1b0f1 cloud.google.com/go/pubsub.(*messageIterator).receive+0x601 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:194
# 0xa211d5 cloud.google.com/go/pubsub.(*Subscription).receive+0x3f5 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:545
# 0xa23f86 cloud.google.com/go/pubsub.(*Subscription).Receive.func2+0x56 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:489
# 0xa17836 golang.org/x/sync/errgroup.(*Group).Go.func1+0x56 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/golang.org/x/sync/errgroup/errgroup.go:58
6 @ 0x42ca3a 0x43c5b0 0xa1b6e9 0x459d51
# 0xa1b6e8 cloud.google.com/go/pubsub.(*messageIterator).sender+0x2c8 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:279
5 @ 0x42ca3a 0x43c5b0 0x955f47 0x459d51
# 0x955f46 google.golang.org/grpc/internal/transport.(*http2Client).keepalive+0x136 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/http2_client.go:1256
2 @ 0x42ca3a 0x42caee 0x404242 0x403efb 0x9a18ba 0x962867 0x97b2eb 0x459d51
# 0x9a18b9 google.golang.org/api/internal.(*PoolResolver).Next+0x49 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/api/internal/pool.go:54
# 0x962866 google.golang.org/grpc.(*roundRobin).watchAddrUpdates+0x56 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/balancer.go:145
# 0x97b2ea google.golang.org/grpc.(*roundRobin).Start.func1+0x2a /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/balancer.go:220
2 @ 0x42ca3a 0x42caee 0x404242 0x403f3b 0x965dac 0x459d51
# 0x965dab google.golang.org/grpc.(*balancerWrapper).lbWatcher+0xeb /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/balancer_v1_wrapper.go:124
1 @ 0x42ca3a 0x43c5b0 0x9712d7 0x96c500 0x977f14 0x977282 0x96850c 0x9682c3 0xa06e12 0xa149c4 0x980261 0x980012 0xa14b60 0xa14cca 0x913f79 0x913e45 0x91414a 0xa0fe1f 0xa241ca 0xa21daf 0xa39f5a 0xa3d8e4 0x459d51
# 0x9712d6 google.golang.org/grpc.(*pickerWrapper).pick+0x566 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/picker_wrapper.go:124
# 0x96c4ff google.golang.org/grpc.(*ClientConn).getTransport+0x6f /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/clientconn.go:721
# 0x977f13 google.golang.org/grpc.(*clientStream).newAttemptLocked+0x153 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:309
# 0x977281 google.golang.org/grpc.newClientStream+0x7c1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:269
# 0x96850b google.golang.org/grpc.invoke+0x9b /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/call.go:66
# 0x9682c2 google.golang.org/grpc.(*ClientConn).Invoke+0x1b2 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/call.go:37
# 0xa06e11 google.golang.org/genproto/googleapis/pubsub/v1.(*publisherClient).ListTopics+0xd1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/genproto/googleapis/pubsub/v1/pubsub.pb.go:2616
# 0xa149c3 cloud.google.com/go/pubsub/apiv1.(*PublisherClient).ListTopics.func1.1+0x83 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/publisher_client.go:237
# 0x980260 github.com/googleapis/gax-go.invoke+0x90 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/github.com/googleapis/gax-go/invoke.go:70
# 0x980011 github.com/googleapis/gax-go.Invoke+0xe1 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/github.com/googleapis/gax-go/invoke.go:48
# 0xa14b5f cloud.google.com/go/pubsub/apiv1.(*PublisherClient).ListTopics.func1+0x13f /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/publisher_client.go:235
# 0xa14cc9 cloud.google.com/go/pubsub/apiv1.(*PublisherClient).ListTopics.func2+0x69 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/publisher_client.go:246
# 0x913f78 google.golang.org/api/iterator.(*PageInfo).fill+0x48 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/api/iterator/iterator.go:143
# 0x913e44 google.golang.org/api/iterator.(*PageInfo).next+0xc4 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/api/iterator/iterator.go:122
# 0x914149 google.golang.org/api/iterator.(*PageInfo).(google.golang.org/api/iterator.next)-fm+0x29 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/google.golang.org/api/iterator/iterator.go:100
# 0xa0fe1e cloud.google.com/go/pubsub/apiv1.(*TopicIterator).Next+0x2e /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/apiv1/publisher_client.go:377
# 0xa241c9 cloud.google.com/go/pubsub.(*Client).Topics.func1+0x29 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/topic.go:230
# 0xa21dae cloud.google.com/go/pubsub.(*TopicIterator).Next+0x2e /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/topic.go:247
# 0xa39f59 bitbucket.org/levenlabs/llib/ldb.(*PubSubConsumer).refreshTopics+0x89 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/bitbucket.org/levenlabs/llib/ldb/pubsub.go:580
# 0xa3d8e3 bitbucket.org/levenlabs/llib/ldb.(*PubSubConsumer).Consume.func2+0x263 /opt/TeamCityBuild/work/wombat-2/c23cc9760fd462e1/.deps/src/bitbucket.org/levenlabs/llib/ldb/pubsub.go:649
I kept 7 of the 12 GCE instances on the latest versions of grpc and this client and downgraded 5 of the 12 to 8dea3dc (1.15.0) and c728a00 (0.27.0), respectively, and all 12 had an issue at the same time. Retries spiked and the following that every server is only receiving 10% of the submissions it should be and the subscription is backing up. Could it be a server-side issue?
Thanks so much for those dumps, that is very useful. We suspect that it may be a gRPC issue, noting that the picker seems stuck in each of the RPC calls. It may be caused by https://github.com/grpc/grpc-go/issues/2341 or https://github.com/grpc/grpc-go/issues/2340: I'm working on a fix for those now and will follow up as soon as it's out.
Hi,
I'm facing similar issue when upgrading to latest version (v0.25.0) from v0.21.0
to solve the deadlocks, I need to restart my VM, here the details:
client: pubsub
environment: VM on GCE
sdk version: v0.25.0
log:
{
insertId: "w72t01f5k32vc"
jsonPayload: {
container: {鈥
data: "{"caller":"service.go:158","lvl":"eror","msg":"rpc error: code = Unavailable desc = there is no connection available","pkg":"service","t":"2018-10-03T20:57:04.737780388+07:00"}"
instance: {鈥
}
logName: "projects/*****/logs/gcplogs-docker-driver"
receiveTimestamp: "2018-10-03T13:57:05.787444130Z"
resource: {鈥
timestamp: "2018-10-03T13:57:04.738081382Z"
}
expected behaviour: retry to connect when error occurs and continue subscribe the data
unacknowledge messages report:

subscribtion code:
go func() {
err := s.gcpPubSub.SubscriptionGET.Receive(s.ctx, func(ctx context.Context, msg *pubsub.Message) {
// processing message
log.Debug(string(msg.Data))
msg.Ack()
})
if err != nil {
log.Error(err.Error())
}
}()
@yogihardi, the latest version is v0.29.0.
@jadekler are there other fixes than the one in https://github.com/grpc/grpc-go/pull/2346? We deployed that change to a few servers and we're noticing that those servers are having the drop-off in jobs at the same time as the other servers.
Here's a goroutine dump:
604 @ 0x42ca3a 0x427d5a 0x4273d7 0x49a29b 0x49a31d 0x49b17d 0x514ccf 0x525d7a 0x623866 0x623d70 0x627310 0x545dc8 0x463706 0x463878 0x9387ab 0x939034 0x955802 0x459d51
# 0x4273d6 internal/poll.runtime_pollWait+0x56 /usr/local/go/src/runtime/netpoll.go:173
# 0x49a29a internal/poll.(*pollDesc).wait+0x9a /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
# 0x49a31c internal/poll.(*pollDesc).waitRead+0x3c /usr/local/go/src/internal/poll/fd_poll_runtime.go:90
# 0x49b17c internal/poll.(*FD).Read+0x17c /usr/local/go/src/internal/poll/fd_unix.go:157
# 0x514cce net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_unix.go:202
# 0x525d79 net.(*conn).Read+0x69 /usr/local/go/src/net/net.go:176
# 0x623865 crypto/tls.(*block).readFromUntil+0x95 /usr/local/go/src/crypto/tls/conn.go:493
# 0x623d6f crypto/tls.(*Conn).readRecord+0xdf /usr/local/go/src/crypto/tls/conn.go:595
# 0x62730f crypto/tls.(*Conn).Read+0xff /usr/local/go/src/crypto/tls/conn.go:1156
# 0x545dc7 bufio.(*Reader).Read+0x237 /usr/local/go/src/bufio/bufio.go:216
# 0x463705 io.ReadAtLeast+0x85 /usr/local/go/src/io/io.go:309
# 0x463877 io.ReadFull+0x57 /usr/local/go/src/io/io.go:327
# 0x9387aa golang.org/x/net/http2.readFrameHeader+0x7a /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/golang.org/x/net/http2/frame.go:237
# 0x939033 golang.org/x/net/http2.(*Framer).ReadFrame+0xa3 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/golang.org/x/net/http2/frame.go:492
# 0x955801 google.golang.org/grpc/internal/transport.(*http2Client).reader+0x171 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/http2_client.go:1205
604 @ 0x42ca3a 0x43c5b0 0x948883 0x948fb8 0x95c100 0x459d51
# 0x948882 google.golang.org/grpc/internal/transport.(*controlBuffer).get+0x132 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/controlbuf.go:317
# 0x948fb7 google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x1a7 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/controlbuf.go:435
# 0x95c0ff google.golang.org/grpc/internal/transport.newHTTP2Client.func3+0x7f /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/http2_client.go:319
41 @ 0x42ca3a 0x42caee 0x404242 0x403efb 0xa240f8 0x459d51
# 0xa240f7 cloud.google.com/go/pubsub.(*Subscription).receive.func1+0x47 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:512
41 @ 0x42ca3a 0x43c5b0 0x97ca4e 0x459d51
# 0x97ca4d google.golang.org/grpc.newClientStream.func5+0xfd /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:287
41 @ 0x42ca3a 0x43c5b0 0xa1b7e9 0x459d51
# 0xa1b7e8 cloud.google.com/go/pubsub.(*messageIterator).sender+0x2c8 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:279
40 @ 0x42ca3a 0x43c5b0 0x95ac60 0x95a9f7 0x95b815 0x463706 0x463878 0x95b75f 0x9739d5 0x97467d 0x97a265 0x97cf5d 0x978ffe 0x979a00 0xa07f92 0xa23d91 0xa1d84c 0xa1daf9 0xa1b4a3 0xa1b1f2 0xa212d6 0xa24087 0xa17937 0x459d51
# 0x95ac5f google.golang.org/grpc/internal/transport.(*recvBufferReader).read+0x1df /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:142
# 0x95a9f6 google.golang.org/grpc/internal/transport.(*recvBufferReader).Read+0x66 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:131
# 0x95b814 google.golang.org/grpc/internal/transport.(*transportReader).Read+0x54 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:401
# 0x463705 io.ReadAtLeast+0x85 /usr/local/go/src/io/io.go:309
# 0x463877 io.ReadFull+0x57 /usr/local/go/src/io/io.go:327
# 0x95b75e google.golang.org/grpc/internal/transport.(*Stream).Read+0xbe /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:385
# 0x9739d4 google.golang.org/grpc.(*parser).recvMsg+0x64 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/rpc_util.go:476
# 0x97467c google.golang.org/grpc.recv+0x4c /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/rpc_util.go:605
# 0x97a264 google.golang.org/grpc.(*csAttempt).recvMsg+0xe4 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:762
# 0x97cf5c google.golang.org/grpc.(*clientStream).RecvMsg.func1+0x3c /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:648
# 0x978ffd google.golang.org/grpc.(*clientStream).withRetry+0x27d /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:530
# 0x9799ff google.golang.org/grpc.(*clientStream).RecvMsg+0x7f /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:647
# 0xa07f91 google.golang.org/genproto/googleapis/pubsub/v1.(*subscriberStreamingPullClient).Recv+0x61 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/genproto/googleapis/pubsub/v1/pubsub.pb.go:3086
# 0xa23d90 cloud.google.com/go/pubsub.(*pullStream).Recv.func1+0x70 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:172
# 0xa1d84b cloud.google.com/go/pubsub.(*pullStream).call+0xcb /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:132
# 0xa1daf8 cloud.google.com/go/pubsub.(*pullStream).Recv+0x58 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:169
# 0xa1b4a2 cloud.google.com/go/pubsub.(*messageIterator).recvMessages+0x32 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:251
# 0xa1b1f1 cloud.google.com/go/pubsub.(*messageIterator).receive+0x601 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:194
# 0xa212d5 cloud.google.com/go/pubsub.(*Subscription).receive+0x3f5 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:545
# 0xa24086 cloud.google.com/go/pubsub.(*Subscription).Receive.func2+0x56 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:489
# 0xa17936 golang.org/x/sync/errgroup.(*Group).Go.func1+0x56 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/golang.org/x/sync/errgroup/errgroup.go:58
4 @ 0x42ca3a 0x43c5b0 0x955fe7 0x459d51
# 0x955fe6 google.golang.org/grpc/internal/transport.(*http2Client).keepalive+0x136 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/http2_client.go:1256
2 @ 0x42ca3a 0x42caee 0x404242 0x403efb 0x9a19ba 0x962907 0x97b34b 0x459d51
# 0x9a19b9 google.golang.org/api/internal.(*PoolResolver).Next+0x49 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/api/internal/pool.go:54
# 0x962906 google.golang.org/grpc.(*roundRobin).watchAddrUpdates+0x56 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/balancer.go:145
# 0x97b34a google.golang.org/grpc.(*roundRobin).Start.func1+0x2a /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/balancer.go:220
2 @ 0x42ca3a 0x42caee 0x404242 0x403f3b 0x965e4c 0x459d51
# 0x965e4b google.golang.org/grpc.(*balancerWrapper).lbWatcher+0xeb /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/balancer_v1_wrapper.go:124
2 @ 0x42ca3a 0x43c5b0 0x964a6a 0x459d51
# 0x964a69 google.golang.org/grpc.(*ccBalancerWrapper).watcher+0x149 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/balancer_conn_wrappers.go:122
2 @ 0x42ca3a 0x43c5b0 0x972f72 0x459d51
# 0x972f71 google.golang.org/grpc.(*ccResolverWrapper).watcher+0x181 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/resolver_conn_wrapper.go:111
1 @ 0x42ca3a 0x43c5b0 0x7911bd 0x459d51
# 0x7911bc go.opencensus.io/stats/view.(*worker).start+0x11c /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/go.opencensus.io/stats/view/worker.go:150
1 @ 0x42ca3a 0x43c5b0 0x95af3f 0x95b02b 0x97a74e 0x97cf5d 0x978dfe 0x979a00 0xa07f92 0xa23d91 0xa1d84c 0xa1daf9 0xa1b4a3 0xa1b1f2 0xa212d6 0xa24087 0xa17937 0x459d51
# 0x95af3e google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader+0xfe /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:241
# 0x95b02a google.golang.org/grpc/internal/transport.(*Stream).RecvCompress+0x2a /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/internal/transport/transport.go:252
# 0x97a74d google.golang.org/grpc.(*csAttempt).recvMsg+0x5cd /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:748
# 0x97cf5c google.golang.org/grpc.(*clientStream).RecvMsg.func1+0x3c /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:648
# 0x978dfd google.golang.org/grpc.(*clientStream).withRetry+0x7d /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:534
# 0x9799ff google.golang.org/grpc.(*clientStream).RecvMsg+0x7f /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/grpc/stream.go:647
# 0xa07f91 google.golang.org/genproto/googleapis/pubsub/v1.(*subscriberStreamingPullClient).Recv+0x61 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/google.golang.org/genproto/googleapis/pubsub/v1/pubsub.pb.go:3086
# 0xa23d90 cloud.google.com/go/pubsub.(*pullStream).Recv.func1+0x70 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:172
# 0xa1d84b cloud.google.com/go/pubsub.(*pullStream).call+0xcb /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:132
# 0xa1daf8 cloud.google.com/go/pubsub.(*pullStream).Recv+0x58 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/pullstream.go:169
# 0xa1b4a2 cloud.google.com/go/pubsub.(*messageIterator).recvMessages+0x32 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:251
# 0xa1b1f1 cloud.google.com/go/pubsub.(*messageIterator).receive+0x601 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/iterator.go:194
# 0xa212d5 cloud.google.com/go/pubsub.(*Subscription).receive+0x3f5 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:545
# 0xa24086 cloud.google.com/go/pubsub.(*Subscription).Receive.func2+0x56 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/cloud.google.com/go/pubsub/subscription.go:489
# 0xa17936 golang.org/x/sync/errgroup.(*Group).Go.func1+0x56 /opt/TeamCityBuild/work/wombat-3/c23cc9760fd462e1/.deps/src/golang.org/x/sync/errgroup/errgroup.go:58
UPDATE: It does seem like one of the servers with the grpc-go change mentioned above did recover faster than the others. In the below image of jobs per host per minute, the yellow server is running the code with the grpc change (as well as 3 other servers) and the purple server is not.

@fastest963 Thanks for this latest goroutine dump. I'll look into it today.
In the meantime, for anyone wanting to resolve this immediately, please try pinning to google-cloud-go=v0.24.0 and grpc-go=1.13.0. I believe these two versions should not have any issues, though if you find they do please let us know.
We deployed those pinned versions to a couple servers but we are not noticing any change versus the latest versions for those repos. I'm starting to think its a server-side issue so I've opened a case. I'll update this once we get confirmation either way.

For the past week and a half i've been dealing with messages not getting pulled. All the messages get processed then consistently an hour later things start backing up, only to recover later on.
I even wrote a bare bones program to pull and immediately ack all messages. Same result. All the services that pull messages that have been running for months have been pulling messages just fine.
Not sure if this is what's being discussed here, figured i would chime in and offer anything i can to help.
Thanks @fastest963. I've brought this up with the backend team internally, too.
@albeebe Looks like similar symptoms, so this is the right place - thanks for the extra data point.
After downgrading to the versions mentioned above (and some server-side fixes) we aren't having the issue anymore. We'll wait for the grpc-go PR to be merged and then we'll test again.
Ah! That's great news. I'll inform the backend folks.
I can also confirm downgrading has eliminated the issue
thanks @jadekler
I will use pinned version as you mentioned, since v0.21 and v0.25 are not stable.
Hi all, a few gRPC bugs were fixed and verified internally over the last week incl https://github.com/grpc/grpc-go/issues/2340 and https://github.com/grpc/grpc-go/issues/2341.
Is anyone able to test against grpc-go=HEAD and google-cloud=go=HEAD? Our current theory is that this is a gRPC related issue, and I suspect that these bug fixes might fix the issue.
I'll also be creating a test subscriber later today to see if I see any problems.
We got the following deployed to a few servers and I'll update in the morning on the status:
google.golang.org/grpc (git) - c05280c cleanup: remove unused channel ready (#2353)
cloud.google.com/go (git) - 5f0ffe7 pubsub: Fix a possible deadlock in pstest
Thank you!
I deployed it to 2 servers (of 8 for this particular topic) at 1:48am UTC and at 2:45am UTC the 2 servers (light blue and yellow) started receiving less submissions. The rest of the non-upgraded servers dropped down a few minutes later at 2:47am. They stayed that way for a while, during which the backlog was filling up with millions of messages, and then the older servers jumped to consuming their max (MaxOutstanding is 1000 per server) at 3:20am but the 2 upgraded servers actually stayed at the low receiving rate until one (yellow) of them jumped up at 3:46am. The other (light blue) upgraded server never ended up recovering until it was restarted at 10:12am. The yellow upgraded server ran unto the same issue at 5:52am and it never recovered until 9:50am (before being restarted).

Given that both non-upgraded and upgraded servers were both affected. This leads me to believe that somehow the new code is causing our topic/subscription to be rate limited or capped in some way so it's affecting the older code as well as the newer code.
Just for kicks I deployed the new code to all 8 servers at 10:10am UTC and after a period of catching up with the backlog, it was working well and stabilized until 11:40am UTC when all of them went to a low receive rate (less than 1k a minute). In addition to the newer library I tweaked the MaxOutstanding to 750 and decreased the time it takes for us to ack each message. Previously the Pub/Sub team told us that we were causing memory issues on the forwarders when we held messages for too long so I thought that by decreasing these I could help prevent any issues, but those seem to not have worked.

At 12:07 UTC I deployed the old (v0.24.0) code and you can see above that it took until 12:14am UTC to actually jump up. The old code has been stable ever since. This is running now with the higher MaxOutstanding of 1000 and the slower ack time.
Let me know what I should do from here. This might be a separate issue since we were never in a state of getting 0 jobs, we just got into a state where we were receiving a very low amount of jobs. I can try using the latest HEAD of grpc-go and see if that makes any difference? Could there be a change in the google-cloud-go package that might be causing us to get server-side limited?
@fastest963 Thanks for that extremely detailed report, this is very valuable. Would you mind taking grpc-go=HEAD, as you suggested? It should help eliminate possible problems. You've given us a couple of leads to look into in the client library, which I'll begin today - thanks again.
@fastest963 Could you provide all your ReceiveSettings, as well as your subscription AckDeadline [1]? I'm seeing about 12,000 messages a second with a naive consumer, and around the same when I set MaxOutstandingMessages=1000 [2]. I've got an idea of why you may be becoming rate limited if you were using a custom ack deadline, but if that's not the case I'll have to dig more.
The yellow upgraded server ran unto the same issue at 5:52am and it never recovered until 9:50am (before being restarted).
By "the same issue", do you mean that Receive wasn't processing any messages, or that Receive was processing less messages than you had expected?
I'll leave these apps running and see if I see a change, or a failure as the original issue described.
1:
res, _ := sub.Config(ctx)
res.AckDeadline
2:

@jadekler Haven't had a chance to deploy the grpc-go=HEAD but its on my plate today.
The settings are
MaxOutstandingMessages: 1000
MaxExtension: 15 seconds
NumGoroutines: 40 (each server has 4 cores)
The ack deadline is 15 seconds. Between 11am UTC and 12pm our time-to-ack was 95th percentile 478ms and 50th percentile 275ms.
By "the same issue", do you mean that Receive wasn't processing any messages, or that Receive was processing less messages than you had expected?
It was still getting some messages just significantly lower than it should.
Also, we only get the issue after an hour or so of running them and them operating fine.
I don't think there's an issue if we changed the ack deadline to the default of 10 seconds. Do you think that'll help? Our jobs used to take longer so I think we had conservatively set it to 15 at some point but that doesn't seem to be necessary anymore.
It was still getting some messages just significantly lower than it should.
Ah, right on. In that case, I think this issue is now finished since the observed behavior has drifted away from the original issue. I'm opening https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1177 to continue our discussion and investigation.
To anyone else - if you experienced this problem, please try upgrading to grpc-go=HEAD and google-cloud-go=HEAD. If you still see 0 messages received, let me know here and I'll re-open this issue. If you are seeing rate-limiting behavior, please let us know in https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1177.
Correction: the bug appears to be in grpc. So, the advice for now is to use grpc-go=v1.13.0 and google-cloud-go=HEAD.