I have a production application which is working with google-cloud-pubsub 1.33.0.
The subscribing processes are deployed to AWS (ECS docker images, a long running server application) and everything is working as expected.
We have had a problem with a stuck message (https://console.cloud.google.com/support/cases/detail/17826998?project=symphony-gce-stage&folder&organizationId=307209726771) which could not be received with this version of the client. Moving to google-cloud-pubsub 1.56.0 allows the stuck message to be received, but unfortunately this does not work when deployed to AWS.
The application seems to hang on this line
The log message
About to get subscription with name projects/symphony-gce-dev/subscriptions/dev-s2smoke1-sym-ac6-us1-s2fwd-presenceV1ExchangeEnvelope
is emitted and then the process just hangs, presumably in the call
subscriptionAdminClient_.getSubscription(projectSubscriptionName);
A previous attempt to move to google-cloud-pubsub 1.51.0 was aborted because it did not work when deployed to AWS, but now we have a bug in 1.33.0 which we need to resolve so we have to move.
No other code changes were made, when we link with 1.33.0 the program runs locally and when deployed to AWS, when we change the maven dependency to 1.56.0 and rebuild the program runs normally from the debugger but hangs as described above.
I am unable to connect a debugger to the hanging process or to log onto the virtual server where it is running, all I have is the logs.
Something has obviously changed between 1.33.0 and 1.51.0 which does not work when deployed to AWS, do you have any idea what could cause this? Is there now a requirement for inbound network connectivity or anything like that?
I realise that this may be difficult to resolve and I appreciate whatever help you can give me.
Thanks for raising this. For some more context, do you know the state of your subscriber (running, failed, etc)? Does this hang on any particular message size?
@kolea2 There are two problems, the stuck message problem occurs on the old version of the pubsub client but not on the newer versions, whatever causes that has been fixed and we need to deploy a newer version of the pubsub client. Details about that problem are in the support case I linked above, the one and only stuck message was over 6 MB which is unusually large for out application, but that's not the point of this issue.
The second problem (this issue), is that with newer versions of the pubsub client the application does not start, it hangs on the first call to pubsub. I provided a link to the source code where it is failing above, in a call to SubscriptionAdminClient. getSubscription(ProjectSubscriptionName), there is no message involved.
In terms of context, we have an initialization container (docker image) which is executed at deploy time, it is responsible for creating topics and subscriptions. Our CICD process builds the code and deploys, the init container starts and it is validating that all the subscriptions exist. It is in the method deleteSubscription to remove an obsolete subscription, in fact this subscription has already been removed, but it is calling getSubscription to see if it does. If the call works a NotFoundException will be thrown and the process will move on.
If I run the process locally on my workstation it runs just fine, if I deploy it as a docker image to AWS it hangs. The exact same application code linked with the old version of the pubsub client runs just fine both locally and when deployed to AWS.
I think something has changed in the low level networking code inside the client which does not work in the AWS runtime environment.
Hi @bruceskingle, thanks for the additional information! For the second problem, it's possible you are running into a similar issue we recently resolved in the go client libraries (https://github.com/googleapis/gax-go/pull/73) where missing ca-certificates were causing hanging issues like you are seeing. Are you using slim/alpine/etc as your container base layer? Additionally, with debug logging on in your application, do you see any TLS or gRPC errors, such as "x509: certificate signed by unknown authority"? Other users who raised similar issues found adding ca-certificates on the image was able to fix this in the meantime. Adding @jadekler for any context I may be missing.
Additional information:
I'm not sure why this issue is only occurring when upgrading versions, but if it turns out to be a certificate problem we can track down where/when this changed more easily and we can possibly make a similar fix in the Java libraries. If not, we'll continue investigating.
I'm not sure what the base layer is but I expect it is Alpine, its an image someone else in our organization built. Upgrading to a later container is on my todo list so I will try that tomorrow and let you know if that makes any difference. If you can give me a reference to a base image you know works I can try to build form that as a means of confirming or eliminating this issue.
Alpine should be fine to use with ca-certificates added in one of the layers.
If it's helpful, @jadekler created a gist for a ruby dockerfile using ca-certificates: https://gist.github.com/jadekler/169cc4bf1ca236f9d380fd733847a8c8
Here is the debug level logging for the call, I see nothing related to certificates:
02:17:58.862 [main] INFO org.symphonyoss.s2.fugue.google.pubsub.GoogleSubscriberAdmin - About to delete subscription dev-s2smoke1-sym-ac6-us1-s2fwd-presenceV1ExchangeEnvelope on topic sym-ac6-dev-chat-glb-1-presenceV1ExchangeEnvelope
02:17:58.867 [main] INFO org.symphonyoss.s2.fugue.google.pubsub.GoogleSubscriberAdmin - About to get subscription with name projects/symphony-gce-dev/subscriptions/dev-s2smoke1-sym-ac6-us1-s2fwd-presenceV1ExchangeEnvelope
02:17:59.054 [grpc-default-executor-0] DEBUG io.grpc.netty.shaded.io.netty.channel.DefaultChannelId - -Dio.netty.processId: 10 (auto-detected)
02:17:59.058 [grpc-default-executor-0] DEBUG io.grpc.netty.shaded.io.netty.util.NetUtil - -Djava.net.preferIPv4Stack: false
02:17:59.058 [grpc-default-executor-0] DEBUG io.grpc.netty.shaded.io.netty.util.NetUtil - -Djava.net.preferIPv6Addresses: false
02:17:59.059 [grpc-default-executor-0] DEBUG io.grpc.netty.shaded.io.netty.util.NetUtil - Loopback interface: lo (lo, 127.0.0.1)
02:17:59.060 [grpc-default-executor-0] DEBUG io.grpc.netty.shaded.io.netty.util.NetUtil - /proc/sys/net/core/somaxconn: 128
02:17:59.061 [grpc-default-executor-0] DEBUG io.grpc.netty.shaded.io.netty.channel.DefaultChannelId - -Dio.netty.machineId: 02:42:ac:ff:fe:11:00:02 (auto-detected)
02:17:59.089 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.netty.util.ResourceLeakDetectorFactory - Loaded default ResourceLeakDetector: io.grpc.netty.shaded.io.netty.util.ResourceLeakDetector@3a72149e
02:17:59.159 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] HANDSHAKEN: TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
02:17:59.165 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] OUTBOUND SETTINGS: ack=false settings={=0, =0, =1048576, =8192}
02:17:59.166 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
02:17:59.172 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] INBOUND SETTINGS: ack=false settings={=100, =1048576, =16384}
02:17:59.173 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] OUTBOUND SETTINGS: ack=true
02:17:59.176 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] INBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
02:17:59.177 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] INBOUND SETTINGS: ack=true
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:18:55.159 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.160 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.160 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-2: Close connection
02:19:55.160 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - shutting down output of dynamodb.us-east-1.amazonaws.com/52.119.226.142:443
02:19:55.160 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - closing dynamodb.us-east-1.amazonaws.com/52.119.226.142:443
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-1: Close connection
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - shutting down output of sts.amazonaws.com/52.46.134.192:443
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - closing sts.amazonaws.com/52.46.134.192:443
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-7: Close connection
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - shutting down output of secretsmanager.us-east-1.amazonaws.com/34.198.55.103:443
02:19:55.161 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - closing secretsmanager.us-east-1.amazonaws.com/34.198.55.103:443
02:19:55.162 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.162 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.162 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-3: Close connection
02:19:55.163 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - shutting down output of sym-s2-dev-s2smoke1-s2fwd-objects.s3.amazonaws.com/52.216.104.51:443
02:19:55.163 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - closing sym-s2-dev-s2smoke1-s2fwd-objects.s3.amazonaws.com/52.216.104.51:443
02:19:55.163 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.163 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.163 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-5: Close connection
02:19:55.163 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - shutting down output of sym-s2-dev-s2smoke1-s2fwd-trace.s3.amazonaws.com/52.216.104.51:443
02:19:55.163 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - closing sym-s2-dev-s2smoke1-s2fwd-trace.s3.amazonaws.com/52.216.104.51:443
02:19:55.164 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.164 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-4: Close connection
02:19:55.164 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - shutting down output of dynamodb.us-east-1.amazonaws.com/52.119.226.142:443
02:19:55.164 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - closing dynamodb.us-east-1.amazonaws.com/52.119.226.142:443
02:19:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:19:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-6: Close connection
02:19:55.165 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - shutting down output of sns.us-east-1.amazonaws.com/52.94.231.16:443
02:19:55.165 [java-sdk-http-connection-reaper] DEBUG com.amazonaws.internal.SdkSSLSocket - closing sns.us-east-1.amazonaws.com/52.94.231.16:443
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:20:55.165 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:21:59.144 [grpc-default-worker-ELG-1-3] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x83b23b8e, L:/172.17.0.2:48570 - R:pubsub.googleapis.com/172.217.7.234:443] INBOUND GO_AWAY: lastStreamId=0 errorCode=0 length=17 bytes=73657373696f6e5f74696d65645f6f7574
02:22:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:22:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:22:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:22:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:22:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:22:55.166 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:22:55.167 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
02:22:55.167 [java-sdk-http-connection-reaper] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Closing connections idle longer than 60000 MILLISECONDS
Thanks for sending these over - I'll continue to look into this. To confirm it's not the issue, can you please also verify ca-certificates are added? Thanks!
I have been trying to do this but our image is based on ubuntu not alpine, can you tell me what ca certificate is actually required?
Of course Java has it's own trust anchor so the gist doesn't work for me "out of the box"....
I'm not sure that alpine is viable for this: https://github.com/infostellarinc/stellarstation-api/issues/66
I think I'm stuck with this at the moment.
@kolea2 I have found that the root cert for the pubsub api URL is indeed missing from the Java cacerts file in our docker base image, but adding it in made no difference, I suspect that the native netty implementation that GRCP uses is going to the OS ca-certificates file which does have the GlobalSign root cert in it. Given this and the fact that the debug log output says nothing about SSL or certificate errors I am persuaded that this is not being caused by a root certificate problem and I now have no idea what the actual problem is.
Please let me know what you think the problem might be.
@kolea2 I can now reproduce the problem with a locally deployed docker container and can attach a debugger.
My code is calling the pubsub sdk at
GoogleSubscriberManager.validateSubcription(TopicName, SubscriptionName) line: 153
which is
com.google.pubsub.v1.Subscription existing = subscriptionAdminClient.getSubscription(projectSubscriptionName);
Here is the full stack trace (after I manually suspended the thread), as you can see the thread is blocked in Unsafe.park as it is doing a get on a Future
Thread [main] (Suspended)
owns: GoogleSubscriberManager (id=8195)
Unsafe.park(boolean, long) line: not available [native method]
LockSupport.park(Object) line: 194
CallbackChainRetryingFuture<ResponseT>(AbstractFuture<V>).get() line: 497
Uninterruptibles.getUninterruptibly(Future<V>) line: 142
Futures.getUnchecked(Future<V>) line: 1558
ApiExceptions.callAndTranslateApiException(ApiFuture<ResponseT>) line: 53
UnaryCallable$1(UnaryCallable<RequestT,ResponseT>).call(RequestT) line: 112
SubscriptionAdminClient.getSubscription(GetSubscriptionRequest) line: 465
SubscriptionAdminClient.getSubscription(ProjectSubscriptionName) line: 418
GoogleSubscriberManager.validateSubcription(TopicName, SubscriptionName) line: 153
GoogleSubscriberManager.initSubscription(SubscriptionImpl<ImmutableByteArray>) line: 135
GoogleSubscriberManager(AbstractSubscriberManager<P,T>).start() line: 207
GoogleSubscriberManager(AbstractPullSubscriberManager<P,T>).start() line: 155
AwsFwd1(AbstractComponentContainer<T>).start() line: 220
AwsFwd1(FugueServer).start() line: 190
AwsFwd1.main(String[]) line: 54
The process has 2 AWS related threads doing something unrelated, Gax-1 to Gax-4 which look like this:
Daemon Thread [Gax-2] (Suspended)
Unsafe.park(boolean, long) line: not available [native method]
LockSupport.park(Object) line: 194
AbstractQueuedSynchronizer$ConditionObject.await() line: 2081
ScheduledThreadPoolExecutor$DelayedWorkQueue.take() line: 1177
ScheduledThreadPoolExecutor$DelayedWorkQueue.take() line: 899
ScheduledThreadPoolExecutor(ThreadPoolExecutor).getTask() line: 1054
ScheduledThreadPoolExecutor(ThreadPoolExecutor).runWorker(ThreadPoolExecutor$Worker) line: 1114
ThreadPoolExecutor$Worker.run() line: 628
Thread.run() line: 834
and grpc-default-worker-ELG-1-1 to grpc-default-worker-ELG-1-3 which all look like this:
Daemon Thread [grpc-default-worker-ELG-1-1] (Suspended)
owns: EPollSelectorImpl (id=8189)
owns: SelectedSelectionKeySet (id=8190)
EPoll.wait(int, long, int, int) line: not available [native method]
EPollSelectorImpl.doSelect(Consumer<SelectionKey>, long) line: 120
EPollSelectorImpl(SelectorImpl).lockAndDoSelect(Consumer<SelectionKey>, long) line: 124
EPollSelectorImpl(SelectorImpl).select(long) line: 136
SelectedSelectionKeySetSelector.select(long) line: 62
NioEventLoop.select(boolean) line: 753
NioEventLoop.run() line: 408
SingleThreadEventExecutor$5.run() line: 897
FastThreadLocalRunnable.run() line: 30
FastThreadLocalThread(Thread).run() line: 834
That's all the threads in the process, it seems to be permanently stuck.
This doesn't seem especially informative to me, does it give you any clues?
@bruceskingle can you add log level to FINE? Does it come out after 10 min?
It never returns, I will re-run with FINE logs
log.txt
@ajaaym here are the logs, the entry
23:31:59.527 [main] INFO org.symphonyoss.s2.fugue.google.pubsub.GoogleSubscriberManager - Validating topic sym-ac6-dev-chat-glb-1-internalMsg...
is from immediately before the call which hangs
@bruceskingle this seems java 9 related issue. Can you add this -Dio.netty.tryReflectionSetAccessible=true ? also add java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED? More information over here
The hanging behaviour is the same with Java8, 10 and 11 and version 1.33.0 seems to work on all of those platforms too, but I am happy to add those options.
@bruceskingle thanks for this log. Is it possible to share docker file for both of this case?
@ajaaym @kolea2 I found the issue, it is to do with the version of grpc. For historic reasons we had a dependency management directive in a parent pom which was pulling in grpc 1.10.1 I have removed this and now it is working.
I have no idea why the problem manifests only when deployed to the docker container.
I have narrowed the problem down to version 1.37.0 of the sdk, i.e. 1.36.0 and all earlier versions I have tried seem to work as expected, but with 1.37.0 and all later versions I have tried the problem manifests, with the old GRPC library.
I am closing this issue as I am now unblocked, thanks for your help.