Azure-cosmos-dotnet-v2: NotFoundException: The read session is not available for the input session token

Created on 19 Mar 2018 · 30Comments · Source: Azure/azure-cosmos-dotnet-v2

Running Microsoft.Azure.DocumentDB.Core 1.9.1 in Direct/TCP mode, there seems to be a race condition under load. If I run a unit test individually, everything is fine. But when I run all unit tests, about 13% of them fail in this fashion (copied from VSTS Build, edited for brevity, can't wait for readable async stack traces in Core 2.1):

  Assert.IsTrue failed. Error : Microsoft.Azure.Documents.NotFoundException: The read session is not available for the input session token.
 ActivityId: 42deed69-29d4-4e77-a00e-e5f4e77d6216, 
 ResponseTime: 2018-03-19T03:07:25.0547892Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:14124/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131654594162065569s/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
 ResponseTime: 2018-03-19T03:07:25.0564466Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:14072/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131654594202446883s/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
 ResponseTime: 2018-03-19T03:07:25.0581713Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:16874/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131657737047974371p/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
 ResponseTime: 2018-03-19T03:07:25.0598054Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:14046/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131654594202446885s/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
 Windows/10.0.14393 documentdb-netcore-sdk/1.9.1
    at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
    at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
    at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
    at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()

As you can see, it retries and fails. This only happens under load, which means I can't go into production with this NuGet package.

Another person reported the same issue on Feb 9, 2018: https://github.com/Azure/azure-documentdb-dotnet/issues/141#issuecomment-364489808

I can reproduce the problem consistently on a production Cosmos DB database in Visual Studio and in VSTS. Running the local emulator does not exhibit the problem.

Source

RedFolder

👍6

Most helpful comment

@kirankumarkolli @arramac @ryancrawcour It's been over a month since I reported this issue. You have shipped a couple of new packages that still exhibit the problem. You're basically wasting your time releasing them because anyone who puts this code into production is in for a world of hurt.

I think I can speak for the all of us on this issue when I say we are ready and willing to help you test any fixes for this. Please advise.

RedFolder on 25 Apr 2018

👍5

All 30 comments

@RonPeters we are investigating it and will get back to you.

kirankumarkolli on 23 Mar 2018

Any progress on this issue?

RedFolder on 4 Apr 2018

I've added more unit tests to my project that exercise Cosmos DB more, and now this problem happens even on the local emulator. I also tried every combination of Direct/Gateway and Tcp/Https, to no avail.

Microsoft.Azure.DocumentDB.Core 1.9.1 is obviously not ready for production use. This is really annoying. I guess I'll have to step back to previous versions and try to find one that actually works.

RedFolder on 8 Apr 2018

My current findings indicate that version 1.7.1 works, and versions >= 1.8.1 do not work. I'll be using 1.7.1 until this problem is fixed. Hopefully there are no blocking issues for me in that older release.

RedFolder on 8 Apr 2018

👍2

Now, I've upgraded a production .NET 4.7 project to Microsoft.Azure.DocumentDB v1.21.1 and now I get the same problem. I have had to downgrade back to Microsoft.Azure.DocumentDB v1.19.1 to keep the tests from failing.

So both the .NET and the .NET Core packages are broken.

RedFolder on 14 Apr 2018

Confirmed that .NET Core packages above 1.7.1 are not working with the local emulator. I do not tested it in production yet.
I tested it with a bunch of unit tests that are re-creating the same collection before every test case.

Please prioritize to the top this issue.

dominikfoldi on 16 Apr 2018

👍4

I can also confirm I am seeing the issue on version 1.9.1 -- this seems to happen when I'm making a lot of calling into my API, which results in concurrent executions into document db; Also, I can confirm that downgrading to 1.7.1 resolves the issue. I'll stay on 1.7.1 until this issue is resolved.

davidhjones on 18 Apr 2018

👍2

I am seeing this a lot now -- Using direct TCP mode, especially when calling CreateDocumentCollectionIfNotExistsAsync() then querying with CreateDocumentQuery(). Using latest 1.10 of Microsoft.Azure.DocumentDB.Core

I can confirm with 100% confidence that downgrading to 1.7.1 fixes the issue.

sonicmouse on 20 Apr 2018

@arramac @ryancrawcour can we get some insight here -- maybe some information about what caused this regression or what we can do to prevent this exception? Furthermore, is this on the radar of the team?

davidhjones on 20 Apr 2018

👍1

I can confirm that the problem is still present in both:
Microsoft.Azure.DocumentDB.Core 1.10.1
Microsoft.Azure.DocumentDB v1.22.0

RedFolder on 20 Apr 2018

I think I can speak for the all of us on this issue when I say we are ready and willing to help you test any fixes for this. Please advise.

RedFolder on 25 Apr 2018

👍5

@RonPeters can you please email me at [email protected]? We'll debug and fix this asap. We were under the impression this is resolved, but we may have only addressed a similar but different bug.

arramac on 25 Apr 2018

👍1

@arramac Hi, we are experiencing the same issue (high load, intermittent exceptions with the same type/message). Did @RonPeters email you with any info so your team could take a look? We could possibly provide some info on this issue. This is high impact. Thanks!

ysnikitin on 3 May 2018

We wanted to upgrade dependencies soon but looks like we should wait for the fix.

ghost on 4 May 2018

If you're experiencing this issue, please use a single collection for the lifetime of the test (deleting all documents on TestCleanup). That way you can upgrade to the latest SDK without issues.

We suspect that this is due to operations immediately following delete -> recreate collection -> read documents. We were unable to get a simple repro (@RonPeters was finding it harder to repro). If anyone can share a repro, we will isolate and fix this.

arramac on 4 May 2018

Sure, we don't not have "delete -> recreate collection" pattern in our scenarios, so once we have confirmation this is only the case we would upgrade.

Thank you for details!

ghost on 4 May 2018

Okay everyone. I found it too difficult to create a consistent test case. The latest emulator behaves differently than the production Cosmos DB on Azure. And the latest NuGet packages behave better with the latest emulator than the older emulator.

Also, I suspect there is some kind of threading during test runs that interferes with the singleton nature of my dependencies, especially my Cosmos DB services. Deleting all the documents from the collection after every test run just exposed other problems.

So in the end, I changed how my tests work to be more compatible with the environment.

I now create a new database with a unique name (append a Guid to the name). The individual test runs on that database with its own fresh collection. That unique database is then deleted after the test.

This required me to reinitialize my dependency container at the beginning of each test so that it would generate a new set of singletons, but that was way easier than continuing to debug this issue.

So I wouldn't say this issue is closed, but this is my workaround. And it works locally and in VSTS CI.

RedFolder on 8 May 2018

👍1

This problem is still present in Microsoft.Azure.DocumentDB 2.1.2. And I don't know if I feel ok with trying to downgrade all the way to 1.7.1 ...
As far as I can tell it is some race condition, it happens intermittently when used inside async calls and fairly consistently when I test inside a Parallel.ForEach() with query that returns either a long result set or you attempt to do pagination

GalmWing on 16 Oct 2018

@kirankumarkolli @arramac Is there an update to this issue?

I'm seeing this problem fairly consistently using Microsoft.Azure.DocumentDB 1.19.1, with the following usage pattern:

Using a single client application that executes the below logic.
Using a singleton DocumentClient instance.
Delete single collection (calling an API endpoint in our Swagger UI that calls Cosmos DB to delete the collection). This succeeds.
Sit and wait 30 seconds or so (no calls to Cosmos DB during this time).
Call another API endpoint (using Swagger UI) that internally triggers multiple paralllel tasks that each try to:
- Call await(CreateDocumentCollectionIfNotExistsAsync(...)). This should recreate the collection that was deleted.
- Call await(UpsertDocumentAsync(...)) into the new collection.

We try to upsert ~30 documents as part of the code that fails; the failure is random in that sometimes the collection doesn't get created, sometimes it does. Sometimes (when the collection is created) a few of the documents also get created but then the code pretty consistently fails with the error:

Microsoft.Azure.Documents.DocumentClientException: The read session is not available for the input session token. ActivityId: 122670c0-c796-4996-ac78-0465e8526cfc, RequestStartTime: 2018-11-28T22:56:07.8070325Z, Number of regions attempted: 1 ResponseTime: 2018-11-28T22:56:07.8070325Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:14066/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152096479248p/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read ResponseTime: 2018-11-28T22:56:07.8070325Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:14001/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152540243401s/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read ResponseTime: 2018-11-28T22:56:07.8070325Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:16734/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152540243400s/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read ResponseTime: 2018-11-28T22:56:07.8170308Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:14139/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152540243402s/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read , Microsoft.Azure.Documents.Common/2.1.0.0, documentdb-dotnet-sdk/1.19.1 Host/64-bit MicrosoftWindowsNT/10.0.16299.0 at Microsoft.Azure.Documents.Client.ClientExtensions.<ParseResponseAsync>d__4.MoveNext()

The code that creates the documents in an empty collection works if I create the collection in the Azure portal by hand before I try to load the data - the ~30 documents get created _mostly_ successfully.

I say _mostly_ because I noticed that UpsertDocumentAsync(...) sometimes fails randomly on its own (less frequently). I think one error I've seen (cannot find it in the logs now) was Entity with the specified id does not exist in the system. This seems like another race condition somewhere in Cosmos DB since I'd expect Upsert to create the item if it's not already in the collection.

gbpcor on 29 Nov 2018

Hi azure-cosmos-dotnet-v2 team. Did you have luck with fixing this this?

We are seeing the issue ocurring intermittently in Microsoft.Azure.DocumentDB.Core 2.1.3 .
Notably, this seems to get triggered when all that's ever done through the DocumentClient instance is normal inserts/updates/deletes/lists, so no collection deleting.

Apart from the error message, the commonality with reports here is that our DocumentClient is a singleton
(but this is the usage recommended for performance by Microsoft - _"Use a singleton DocumentDB client for the lifetime of your application Note that each DocumentClient instance is thread-safe"_)

slawomir-brzezinski-at-clarksons on 25 Jan 2019

I am also seeing this issue when calling into the Gremlin endpoint. The error response says that the version is documentdb-dotnet-sdk/2.2.2.

A sample error response:

ActivityId : 30d67961-e8c2-4c5d-bcea-8354bd3b4d46
ExceptionType : NotFoundException
ExceptionMessage :
Entity with the specified id does not exist in the system.
ActivityId: 30d67961-e8c2-4c5d-bcea-8354bd3b4d46, documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0
Source : Microsoft.Azure.Documents.ClientEntity with the specified id does not exist in the system.
ActivityId: 30d67961-e8c2-4c5d-bcea-8354bd3b4d46, documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0
BackendStatusCode : NotFound
BackendActivityId : 30d67961-e8c2-4c5d-bcea-8354bd3b4d46
HResult : 0x80131500

at Gremlin.Net.Driver.Messages.ResponseStatusExtensions.ThrowIfStatusIndicatesError(ResponseStatus status)
at Gremlin.Net.Driver.Connection.TryParseResponseMessage(ResponseMessage1 receivedMsg) at Gremlin.Net.Driver.Connection.Parse(Byte[] received) --- End of stack trace from previous location where exception was thrown --- at Gremlin.Net.Driver.ProxyConnection.SubmitAsync[T](RequestMessage requestMessage) at Gremlin.Net.Driver.GremlinClient.SubmitAsync[T](RequestMessage requestMessage) at Gremlin.Net.Driver.GremlinClientExtensions.SubmitAsync[T](IGremlinClient gremlinClient, String requestScript, Dictionary2 bindings)
at AssocIO.Core.Repository.CosmosDbRepository.ExecuteQuery(String query, Dictionary`2 parameters) in C:devAssocIO.APIAssocIO.CoreRepositoryCosmosDbRepository.cs:line 93

I don't know if it is actually using this code base but it seems to be the same issue
Could you update us on when a fix is expected?

MikeHook on 25 Mar 2019

When I create edges in cosmosDB gremlinAPI for around 1000+ it shows me below error. Anyone have suggestion please.

ActivityId : 36954616-f5be-403d-bb90-8fd5c3f67dd3 ExceptionType : NotFoundException ExceptionMessage : Message: {"Errors":["The read\/write session is not available."]} ActivityId: 57fc0891-6cc6-4dbc-b31e-cdc61b0d4ba3, Request URI: /apps/6f477c7d-32cf-4c41-b646-2079af54f70a/services/f33e9d02-a6cd-4e41-9744-9bf49bfb5ef3/partitions/4215525b-560a-46d6-8b93-7144d7a4101a/replicas/131983338656965687p/, RequestStats: RequestStartTime: 2019-04-19T10:42:44.0404478Z, Number of regions attempted: 1 , SDK: documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0 Source : Microsoft.Azure.Documents.ClientMessage: {"Errors":["The read\/write session is not available."]} ActivityId: 57fc0891-6cc6-4dbc-b31e-cdc61b0d4ba3, Request URI: /apps/6f477c7d-32cf-4c41-b646-2079af54f70a/services/f33e9d02-a6cd-4e41-9744-9bf49bfb5ef3/partitions/4215525b-560a-46d6-8b93-7144d7a4101a/replicas/131983338656965687p/, RequestStats: RequestStartTime: 2019-04-19T10:42:44.0404478Z, Number of regions attempted: 1 , SDK: documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0 BackendStatusCode : NotFound BackendActivityId : 57fc0891-6cc6-4dbc-b31e-cdc61b0d4ba3 HResult : 0x80131500

bhaumik-gandhi on 19 Apr 2019

We've run into this issue as well--perhaps some context will help track down the problem:

We encountered this error while testing an Azure Function App, which uses an output binding to write to a CosmosDB instance. The failure occurs when running a suite of integration tests which are creating a new database and container with each run. We have never seen this error in any of our production workflows for the same code.

We still saw the same behavior even at version 1.7.1 of the Microsoft.Azure.DocumentDB, but we're only using that library to set up the database; not sure what the Function binding itself is using. The failure was intermittent--the test suite would pass on the first run, but generally fail on the second and subsequent runs.

Restarting our local Azure Function host between test runs seems to have resolved the problem for us.

lobsteropteryx on 10 Jul 2019

I'm seeing these issues as well, consistently after I delete and recreate a collection with the same name. I am using the Gremlin endpoint.

Interestingly, the error inside the stack trace uses the wording "Indicate an error" - and the query has successfully updated data as far as I can tell.

$exception._message: 
ServerError: \r\n\nActivityId : 9b73d734-f436-41bd-a836-858bed370318\nExceptionType : NotFoundException\nExceptionMessage :\r\n\tThe read session is not available for the input session token.\r\n\tActivityId: 0263231f-2a59-47c5-84e2-8b24f26bd07a, \r\n\tRequestStartTime: 2019-07-17T17:18:15.1543739Z, RequestEndTime: 2019-07-17T17:18:15.2012662Z, Number of regions attempted: 1\r\n\tResponseTime: 2019-07-17T17:18:15.1543739Z, ///, documentdb-dotnet-sdk/2.4.0 Host/64-bit MicrosoftWindowsNT/6.2.9200.0

Stack trace:

at Gremlin.Net.Driver.Messages.ResponseStatusExtensions.ThrowIfStatusIndicatesError(ResponseStatus status) in /home/smallette/git/apache/tinkerpop/gremlin-dotnet/src/Gremlin.Net/Driver/Messages/ResponseStatus.cs:line 47\r\n   at Gremlin.Net.Driver.Connection.TryParseResponseMessage(ResponseMessage`1 receivedMsg) in /home/smallette/git/apache/tinkerpop/gremlin-dotnet/src/Gremlin.Net/Driver/Connection.cs:line 137\r\n   at Gremlin.Net.Driver.Connection.Parse(Byte[] received) in /home/smallette/git/apache/tinkerpop/gremlin-dotnet/src/Gremlin.Net/Driver/Connection.cs:line 123\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at Gremlin.Net.Driver.ProxyConnection.SubmitAsync[T](RequestMessage requestMessage)\r\n   at Gremlin.Net.Driver.GremlinClient.SubmitAsync[T](RequestMessage requestMessage)\r\n   at Gremlin.Net.Driver.GremlinClientExtensions.SubmitAsync[T](IGremlinClient gremlinClient, String requestScript, Dictionary`2 bindings)\r\n

anders-lundgren on 17 Jul 2019

Also facing this recreating Gremlin collections with the same name. We're using the Node gremlin client though,

hanvyj on 13 Sep 2019

@hanvyj are you doing queries? If so can you try wrapping it in some retry logic to run the query again if it hits this failure?

j82w on 13 Sep 2019

It didn't seem to work at all, regardless of retries.

I deleted and created it again and it failed a handful of times and it's been fine since. Maybe it would have started working if I'd kept retrying the first time!

hanvyj on 13 Sep 2019

Are there any updates here? We have been seeing this issue as well, with versions as recent as 2.6.0 for the Microsoft.Azure.DocumentDb package. Any suggestions on how to get around this problem would be welcome.

mastaehely on 23 Mar 2020

on 2.x version if you are facing this issue, please make sure you upgrade to the latest version as per here:
https://www.nuget.org/packages/Microsoft.Azure.DocumentDB/

In the above comments, there are suggestions by others to downgrade SDK, the suggestions is not from CosmosDB and we believe it is not the correct fix for the issue.

I am closing this issue here. If you face The read session is not available failure message, please ensure you move to the newest version of SDK and in case you face the issue again please open a new github issue. I am retiring this issue to avoid confusion around suggestion to downgrade.