Azure-sdk-for-java: [BUG] Cosmos SDK seems to hang

Created on 15 Sep 2020  路  10Comments  路  Source: Azure/azure-sdk-for-java

We are seeing an issue in production where cosmos SDK v4 seems to hang i.e., our API requests have fired a query and waiting indefinitely. Issue gets resolved after app goes through a restart. We have 3 different function apps which connects to common cosmos DB instance. Issue seems to always occurs sometime around ~1.30PM PST once a week and when the issue occurs, it happens on all 3 function apps which are hosted separately. Function runtime detects multiple calls taking long time due to wait and in few mins recycles the app as an auto-heal mechanism, so I am really not able to take any kind of thread dumps as well.

Inspecting the logs, we mostly see these 2 exception traces during the time of the issue

io.netty.channel.ConnectTimeoutException
connection timed out: cdb-ms-prod-westus1-fd44.documents.azure.com/40.112.241.44:14325

Parsed Stacktrace:
[
{"method": "io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run","level": 0,"line": 267,"fileName": "AbstractNioChannel.java"},
{"method": "io.netty.util.concurrent.PromiseTask$RunnableAdapter.call","level": 1,"line": 38,"fileName": "PromiseTask.java"},
{"method": "io.netty.util.concurrent.ScheduledFutureTask.run","level": 2,"line": 127,"fileName": "ScheduledFutureTask.java"},
{"method": "io.netty.util.concurrent.AbstractEventExecutor.safeExecute","level": 3,"line": 163,"fileName": "AbstractEventExecutor.java"},
{"method": "io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks","level": 4,"line": 404,"fileName": "SingleThreadEventExecutor.java"},
{"method": "io.netty.channel.nio.NioEventLoop.run","level": 5,"line": 462,"fileName": "NioEventLoop.java"},
{"method": "io.netty.util.concurrent.SingleThreadEventExecutor$5.run","level": 6,"line": 905,"fileName": "SingleThreadEventExecutor.java"},
{"method": "io.netty.util.concurrent.FastThreadLocalRunnable.run","level": 7,"line": 30,"fileName": "FastThreadLocalRunnable.java"},
{"method": "java.lang.Thread.run","level": 8,"line": 748,"fileName": "Thread.java"}

And/Or

RntbdServiceEndpoint({"id":1,"isClosed":true,"concurrentRequests":0,"remoteAddress":"cdb-ms-prod-westus1-fd44.documents.azure.com:14325","channelPool":{"remoteAddress":"cdb-ms-prod-westus1-fd44.documents.azure.com:14325","isClosed":true,"configuration":{"maxChannels":130,"maxRequestsPerChannel":30,"idleConnectionTimeout":0,"readDelayLimit":65000000000,"writeDelayLimit":10000000000},"state":{"channelsAcquired":0,"channelsAvailable":0,"requestQueueLength":0}}}) is closed

Steps to reproduce the behavior: Unfortunately, I am not able to reproduce this issue at all on non-prod.

Version of the SDK currently used: 4.3.0

I wanted to check if team can suggest some options to either replicate this issue on non-prod and/or provide some directions to help resolve this issue.

Client Cosmos customer-reported needs-author-feedback needs-team-attention question tenet-reliability

All 10 comments

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Wmengmsft, @MehaKaushik, @shurd, @anfeldma-ms

@kushagraThapar can you follow up with some additional debugging (Tracing) instructions and next steps?

@rkganji - Can you please provide some diagnostics logs?
The way to get them would be to log them using any of the response types, and then calling getDiagnostics() API on the response type.

Time at which connection timed out was reported (PST): 9/7/2020 1:20:38 PM
I have attached the cosmos diagnostic logs starting 1.10PM to 1.20PM.
Part of requests from overall traffic have seen to gone into hung mode post connection time out was reported.

I don't have diagnostics logs for the actual requests which are failing since we don't get back response at all for those.
query_data.csv.zip

@rkganji - thanks I will look at the logs.

@rkganji We have released a new version yesterday related to this issue : v4.5.0. Can you please upgrade and test ?
Here is the changelog : https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/cosmos/azure-cosmos/CHANGELOG.md#450-2020-09-16

@kushagraThapar I don't see v4.5.0 yet here https://mvnrepository.com/artifact/com.azure/azure-cosmos

Maven takes time to update to the central index, you can find it here: https://repo1.maven.org/maven2/com/azure/azure-cosmos/

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

I confirm that SDK is recovering fine after this version upgrade when intermittent connection issues occur.

Was this page helpful?
0 / 5 - 0 ratings