Running Microsoft.Azure.DocumentDB.Core 1.9.1 in Direct/TCP mode, there seems to be a race condition under load. If I run a unit test individually, everything is fine. But when I run all unit tests, about 13% of them fail in this fashion (copied from VSTS Build, edited for brevity, can't wait for readable async stack traces in Core 2.1):
Assert.IsTrue failed. Error : Microsoft.Azure.Documents.NotFoundException: The read session is not available for the input session token.
ActivityId: 42deed69-29d4-4e77-a00e-e5f4e77d6216,
ResponseTime: 2018-03-19T03:07:25.0547892Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:14124/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131654594162065569s/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
ResponseTime: 2018-03-19T03:07:25.0564466Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:14072/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131654594202446883s/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
ResponseTime: 2018-03-19T03:07:25.0581713Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:16874/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131657737047974371p/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
ResponseTime: 2018-03-19T03:07:25.0598054Z, StoreReadResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northcentralus1-fd1.documents.azure.com:14046/apps/8c91a418-c1cf-4beb-a4b8-73c0abb39bb4/services/f319454d-9d78-42bc-999d-f0d1e84e8cd1/partitions/6bf44e22-c705-4c3d-a2e4-f2fd08eac3e8/replicas/131654594202446885s/, LSN: 8, GlobalCommittedLsn: 8, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: True, RequestCharge: 0, ItemLSN: -1, ResourceType: Document, OperationType: Read
Windows/10.0.14393 documentdb-netcore-sdk/1.9.1
at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
As you can see, it retries and fails. This only happens under load, which means I can't go into production with this NuGet package.
Another person reported the same issue on Feb 9, 2018: https://github.com/Azure/azure-documentdb-dotnet/issues/141#issuecomment-364489808
I can reproduce the problem consistently on a production Cosmos DB database in Visual Studio and in VSTS. Running the local emulator does not exhibit the problem.
@RonPeters we are investigating it and will get back to you.
Any progress on this issue?
I've added more unit tests to my project that exercise Cosmos DB more, and now this problem happens even on the local emulator. I also tried every combination of Direct/Gateway and Tcp/Https, to no avail.
Microsoft.Azure.DocumentDB.Core 1.9.1 is obviously not ready for production use. This is really annoying. I guess I'll have to step back to previous versions and try to find one that actually works.
My current findings indicate that version 1.7.1 works, and versions >= 1.8.1 do not work. I'll be using 1.7.1 until this problem is fixed. Hopefully there are no blocking issues for me in that older release.
Now, I've upgraded a production .NET 4.7 project to Microsoft.Azure.DocumentDB v1.21.1 and now I get the same problem. I have had to downgrade back to Microsoft.Azure.DocumentDB v1.19.1 to keep the tests from failing.
So both the .NET and the .NET Core packages are broken.
Confirmed that .NET Core packages above 1.7.1 are not working with the local emulator. I do not tested it in production yet.
I tested it with a bunch of unit tests that are re-creating the same collection before every test case.
Please prioritize to the top this issue.
I can also confirm I am seeing the issue on version 1.9.1 -- this seems to happen when I'm making a lot of calling into my API, which results in concurrent executions into document db; Also, I can confirm that downgrading to 1.7.1 resolves the issue. I'll stay on 1.7.1 until this issue is resolved.
I am seeing this a lot now -- Using direct TCP mode, especially when calling CreateDocumentCollectionIfNotExistsAsync() then querying with CreateDocumentQuery(). Using latest 1.10 of Microsoft.Azure.DocumentDB.Core
I can confirm with 100% confidence that downgrading to 1.7.1 fixes the issue.
@arramac @ryancrawcour can we get some insight here -- maybe some information about what caused this regression or what we can do to prevent this exception? Furthermore, is this on the radar of the team?
I can confirm that the problem is still present in both:
Microsoft.Azure.DocumentDB.Core 1.10.1
Microsoft.Azure.DocumentDB v1.22.0
@kirankumarkolli @arramac @ryancrawcour It's been over a month since I reported this issue. You have shipped a couple of new packages that still exhibit the problem. You're basically wasting your time releasing them because anyone who puts this code into production is in for a world of hurt.
I think I can speak for the all of us on this issue when I say we are ready and willing to help you test any fixes for this. Please advise.
@RonPeters can you please email me at [email protected]? We'll debug and fix this asap. We were under the impression this is resolved, but we may have only addressed a similar but different bug.
@arramac Hi, we are experiencing the same issue (high load, intermittent exceptions with the same type/message). Did @RonPeters email you with any info so your team could take a look? We could possibly provide some info on this issue. This is high impact. Thanks!
We wanted to upgrade dependencies soon but looks like we should wait for the fix.
If you're experiencing this issue, please use a single collection for the lifetime of the test (deleting all documents on TestCleanup). That way you can upgrade to the latest SDK without issues.
We suspect that this is due to operations immediately following delete -> recreate collection -> read documents. We were unable to get a simple repro (@RonPeters was finding it harder to repro). If anyone can share a repro, we will isolate and fix this.
Sure, we don't not have "delete -> recreate collection" pattern in our scenarios, so once we have confirmation this is only the case we would upgrade.
Thank you for details!
Okay everyone. I found it too difficult to create a consistent test case. The latest emulator behaves differently than the production Cosmos DB on Azure. And the latest NuGet packages behave better with the latest emulator than the older emulator.
Also, I suspect there is some kind of threading during test runs that interferes with the singleton nature of my dependencies, especially my Cosmos DB services. Deleting all the documents from the collection after every test run just exposed other problems.
So in the end, I changed how my tests work to be more compatible with the environment.
I now create a new database with a unique name (append a Guid to the name). The individual test runs on that database with its own fresh collection. That unique database is then deleted after the test.
This required me to reinitialize my dependency container at the beginning of each test so that it would generate a new set of singletons, but that was way easier than continuing to debug this issue.
So I wouldn't say this issue is closed, but this is my workaround. And it works locally and in VSTS CI.
This problem is still present in Microsoft.Azure.DocumentDB 2.1.2. And I don't know if I feel ok with trying to downgrade all the way to 1.7.1 ...
As far as I can tell it is some race condition, it happens intermittently when used inside async calls and fairly consistently when I test inside a Parallel.ForEach() with query that returns either a long result set or you attempt to do pagination
@kirankumarkolli @arramac Is there an update to this issue?
I'm seeing this problem fairly consistently using Microsoft.Azure.DocumentDB 1.19.1, with the following usage pattern:
await(CreateDocumentCollectionIfNotExistsAsync(...)). This should recreate the collection that was deleted.await(UpsertDocumentAsync(...)) into the new collection.We try to upsert ~30 documents as part of the code that fails; the failure is random in that sometimes the collection doesn't get created, sometimes it does. Sometimes (when the collection is created) a few of the documents also get created but then the code pretty consistently fails with the error:
Microsoft.Azure.Documents.DocumentClientException: The read session is not available for the input session token.
ActivityId: 122670c0-c796-4996-ac78-0465e8526cfc,
RequestStartTime: 2018-11-28T22:56:07.8070325Z, Number of regions attempted: 1
ResponseTime: 2018-11-28T22:56:07.8070325Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:14066/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152096479248p/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read
ResponseTime: 2018-11-28T22:56:07.8070325Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:14001/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152540243401s/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read
ResponseTime: 2018-11-28T22:56:07.8070325Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:16734/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152540243400s/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read
ResponseTime: 2018-11-28T22:56:07.8170308Z, StoreReadResult: StorePhysicalAddress: rntbd://sn4prdapp19-docdb-1.documents.azure.com:14139/apps/c24ac90c-fff5-47b8-ab22-3d0ceb4f6dcf/services/e4b4525f-dabc-4bc5-97c7-b2865d416e47/partitions/b13d8032-2951-4cc7-b428-fa7f50513e61/replicas/131879152540243402s/, LSN: 1, GlobalCommittedLsn: 1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, IsGone: False, IsNotFound: True, IsInvalidPartition: False, RequestCharge: 0, ItemLSN: -1, SessionToken: 1, ResourceType: Collection, OperationType: Read
, Microsoft.Azure.Documents.Common/2.1.0.0, documentdb-dotnet-sdk/1.19.1 Host/64-bit MicrosoftWindowsNT/10.0.16299.0
at Microsoft.Azure.Documents.Client.ClientExtensions.<ParseResponseAsync>d__4.MoveNext()
The code that creates the documents in an empty collection works if I create the collection in the Azure portal by hand before I try to load the data - the ~30 documents get created _mostly_ successfully.
I say _mostly_ because I noticed that UpsertDocumentAsync(...) sometimes fails randomly on its own (less frequently). I think one error I've seen (cannot find it in the logs now) was Entity with the specified id does not exist in the system. This seems like another race condition somewhere in Cosmos DB since I'd expect Upsert to create the item if it's not already in the collection.
Hi azure-cosmos-dotnet-v2 team. Did you have luck with fixing this this?
We are seeing the issue ocurring intermittently in Microsoft.Azure.DocumentDB.Core 2.1.3 .
Notably, this seems to get triggered when all that's ever done through the DocumentClient instance is normal inserts/updates/deletes/lists, so no collection deleting.
Apart from the error message, the commonality with reports here is that our DocumentClient is a singleton
(but this is the usage recommended for performance by Microsoft - _"Use a singleton DocumentDB client for the lifetime of your application Note that each DocumentClient instance is thread-safe"_)
I am also seeing this issue when calling into the Gremlin endpoint. The error response says that the version is documentdb-dotnet-sdk/2.2.2.
A sample error response:
ActivityId : 30d67961-e8c2-4c5d-bcea-8354bd3b4d46
ExceptionType : NotFoundException
ExceptionMessage :
Entity with the specified id does not exist in the system.
ActivityId: 30d67961-e8c2-4c5d-bcea-8354bd3b4d46, documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0
Source : Microsoft.Azure.Documents.ClientEntity with the specified id does not exist in the system.
ActivityId: 30d67961-e8c2-4c5d-bcea-8354bd3b4d46, documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0
BackendStatusCode : NotFound
BackendActivityId : 30d67961-e8c2-4c5d-bcea-8354bd3b4d46
HResult : 0x80131500at Gremlin.Net.Driver.Messages.ResponseStatusExtensions.ThrowIfStatusIndicatesError(ResponseStatus status)
at Gremlin.Net.Driver.Connection.TryParseResponseMessage(ResponseMessage1 receivedMsg) at Gremlin.Net.Driver.Connection.Parse(Byte[] received) --- End of stack trace from previous location where exception was thrown --- at Gremlin.Net.Driver.ProxyConnection.SubmitAsync[T](RequestMessage requestMessage) at Gremlin.Net.Driver.GremlinClient.SubmitAsync[T](RequestMessage requestMessage) at Gremlin.Net.Driver.GremlinClientExtensions.SubmitAsync[T](IGremlinClient gremlinClient, String requestScript, Dictionary2 bindings)
at AssocIO.Core.Repository.CosmosDbRepository.ExecuteQuery(String query, Dictionary`2 parameters) in C:devAssocIO.APIAssocIO.CoreRepositoryCosmosDbRepository.cs:line 93
I don't know if it is actually using this code base but it seems to be the same issue
Could you update us on when a fix is expected?
When I create edges in cosmosDB gremlinAPI for around 1000+ it shows me below error. Anyone have suggestion please.
ActivityId : 36954616-f5be-403d-bb90-8fd5c3f67dd3
ExceptionType : NotFoundException
ExceptionMessage :
Message: {"Errors":["The read\/write session is not available."]}
ActivityId: 57fc0891-6cc6-4dbc-b31e-cdc61b0d4ba3, Request URI: /apps/6f477c7d-32cf-4c41-b646-2079af54f70a/services/f33e9d02-a6cd-4e41-9744-9bf49bfb5ef3/partitions/4215525b-560a-46d6-8b93-7144d7a4101a/replicas/131983338656965687p/, RequestStats:
RequestStartTime: 2019-04-19T10:42:44.0404478Z, Number of regions attempted: 1
, SDK: documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0
Source : Microsoft.Azure.Documents.ClientMessage: {"Errors":["The read\/write session is not available."]}
ActivityId: 57fc0891-6cc6-4dbc-b31e-cdc61b0d4ba3, Request URI: /apps/6f477c7d-32cf-4c41-b646-2079af54f70a/services/f33e9d02-a6cd-4e41-9744-9bf49bfb5ef3/partitions/4215525b-560a-46d6-8b93-7144d7a4101a/replicas/131983338656965687p/, RequestStats:
RequestStartTime: 2019-04-19T10:42:44.0404478Z, Number of regions attempted: 1
, SDK: documentdb-dotnet-sdk/2.2.2 Host/64-bit MicrosoftWindowsNT/6.2.9200.0
BackendStatusCode : NotFound
BackendActivityId : 57fc0891-6cc6-4dbc-b31e-cdc61b0d4ba3
HResult : 0x80131500
We've run into this issue as well--perhaps some context will help track down the problem:
We encountered this error while testing an Azure Function App, which uses an output binding to write to a CosmosDB instance. The failure occurs when running a suite of integration tests which are creating a new database and container with each run. We have never seen this error in any of our production workflows for the same code.
We still saw the same behavior even at version 1.7.1 of the Microsoft.Azure.DocumentDB, but we're only using that library to set up the database; not sure what the Function binding itself is using. The failure was intermittent--the test suite would pass on the first run, but generally fail on the second and subsequent runs.
Restarting our local Azure Function host between test runs seems to have resolved the problem for us.
I'm seeing these issues as well, consistently after I delete and recreate a collection with the same name. I am using the Gremlin endpoint.
Interestingly, the error inside the stack trace uses the wording "Indicate an error" - and the query has successfully updated data as far as I can tell.
$exception._message:
ServerError: \r\n\nActivityId : 9b73d734-f436-41bd-a836-858bed370318\nExceptionType : NotFoundException\nExceptionMessage :\r\n\tThe read session is not available for the input session token.\r\n\tActivityId: 0263231f-2a59-47c5-84e2-8b24f26bd07a, \r\n\tRequestStartTime: 2019-07-17T17:18:15.1543739Z, RequestEndTime: 2019-07-17T17:18:15.2012662Z, Number of regions attempted: 1\r\n\tResponseTime: 2019-07-17T17:18:15.1543739Z, ///, documentdb-dotnet-sdk/2.4.0 Host/64-bit MicrosoftWindowsNT/6.2.9200.0
Stack trace:
at Gremlin.Net.Driver.Messages.ResponseStatusExtensions.ThrowIfStatusIndicatesError(ResponseStatus status) in /home/smallette/git/apache/tinkerpop/gremlin-dotnet/src/Gremlin.Net/Driver/Messages/ResponseStatus.cs:line 47\r\n at Gremlin.Net.Driver.Connection.TryParseResponseMessage(ResponseMessage`1 receivedMsg) in /home/smallette/git/apache/tinkerpop/gremlin-dotnet/src/Gremlin.Net/Driver/Connection.cs:line 137\r\n at Gremlin.Net.Driver.Connection.Parse(Byte[] received) in /home/smallette/git/apache/tinkerpop/gremlin-dotnet/src/Gremlin.Net/Driver/Connection.cs:line 123\r\n--- End of stack trace from previous location where exception was thrown ---\r\n at Gremlin.Net.Driver.ProxyConnection.SubmitAsync[T](RequestMessage requestMessage)\r\n at Gremlin.Net.Driver.GremlinClient.SubmitAsync[T](RequestMessage requestMessage)\r\n at Gremlin.Net.Driver.GremlinClientExtensions.SubmitAsync[T](IGremlinClient gremlinClient, String requestScript, Dictionary`2 bindings)\r\n
Also facing this recreating Gremlin collections with the same name. We're using the Node gremlin client though,
@hanvyj are you doing queries? If so can you try wrapping it in some retry logic to run the query again if it hits this failure?
It didn't seem to work at all, regardless of retries.
I deleted and created it again and it failed a handful of times and it's been fine since. Maybe it would have started working if I'd kept retrying the first time!
Are there any updates here? We have been seeing this issue as well, with versions as recent as 2.6.0 for the Microsoft.Azure.DocumentDb package. Any suggestions on how to get around this problem would be welcome.
on 2.x version if you are facing this issue, please make sure you upgrade to the latest version as per here:
https://www.nuget.org/packages/Microsoft.Azure.DocumentDB/
In the above comments, there are suggestions by others to downgrade SDK, the suggestions is not from CosmosDB and we believe it is not the correct fix for the issue.
I am closing this issue here. If you face The read session is not available failure message, please ensure you move to the newest version of SDK and in case you face the issue again please open a new github issue. I am retiring this issue to avoid confusion around suggestion to downgrade.
@mastaehely as @moderakh mentioned, please open a new Issue with your particular case including:
Most helpful comment
@kirankumarkolli @arramac @ryancrawcour It's been over a month since I reported this issue. You have shipped a couple of new packages that still exhibit the problem. You're basically wasting your time releasing them because anyone who puts this code into production is in for a world of hurt.
I think I can speak for the all of us on this issue when I say we are ready and willing to help you test any fixes for this. Please advise.