Describe the bug
I'm not sure if this is a bug or just a normal behaviour that needs to be handled by consumer. However the response we get is not documented anywhere (or I wasn't able to find it).
Basically we are getting no result with a continuation token in a situation that we are not expecting it.
To Reproduce
This is very hard to reproduce as the consumer does not control the shard (physical partition) distribution. But you need a comosdb that has a few logical partitions and at least two shards and your query should be formed aiming for the data in the second shared. Do not provide a partition key and make the query cross partition.
Expected behavior
When:
I'm expecting to receive the result in the first call.
Actual behavior
The token looks like below:
{"token":null,"range":{"min":"05C1DFFFFFFFFC","max":"FF"}}
Environment summary
SDK Version: 2.2.3
OS Version Windows (different versions)
Additional context
Following is the sample code that I can reproduce the issue every single time. In this case I have a document sitting in partition 2 (index 1) which I assume it's the second shard.
var client = new DocumentClient(ServiceEndpoint, AuthKey);
const string query = "select * from c where c.title='JACK CALLAGHAN'";
var collection = UriFactory.CreateDocumentCollectionUri(DatabaseId, CollectionId);
var cQuery = client.CreateDocumentQuery<dynamic>(collection, query, new FeedOptions
{
EnableCrossPartitionQuery = true,
PopulateQueryMetrics = true
}).AsDocumentQuery();
var response = cQuery.ExecuteNextAsync().GetAwaiter().GetResult();
Console.WriteLine($"response.AsEnumerable().Count()= {response.AsEnumerable().Count()}");
foreach (string headerKey in response.ResponseHeaders.Keys)
{
Console.WriteLine($"{headerKey}");
var keyValues = response.ResponseHeaders[headerKey].Split(";");
foreach (var keyValue in keyValues)
{
Console.WriteLine($"{keyValue}");
}
Console.WriteLine();
}
Following is the output with all the header keys:
response.AsEnumerable().Count()= 0
Cache-Control
no-store, no-cache
Pragma
no-cache
Transfer-Encoding
chunked
Server
Microsoft-HTTPAPI/2.0
Strict-Transport-Security
max-age=31536000
x-ms-last-state-change-utc
Wed, 03 Apr 2019 00:50:35.469 GMT
x-ms-resource-quota
documentSize=51200
documentsSize=52428800
documentsCount=-1
collectionSize=52428800
x-ms-resource-usage
documentSize=184
documentsSize=164076
documentsCount=94186
collectionSize=188910
lsn
118852
x-ms-item-count
0
x-ms-schemaversion
1.7
x-ms-alt-content-path
dbs/bettingedge/colls/fixtures
x-ms-content-path
S8sXAPPiCdc=
x-ms-xp-role
1
x-ms-documentdb-query-metrics
totalExecutionTimeInMs=0.27
queryCompileTimeInMs=0.04
queryLogicalPlanBuildTimeInMs=0.02
queryPhysicalPlanBuildTimeInMs=0.03
queryOptimizationTimeInMs=0.00
VMExecutionTimeInMs=0.06
indexLookupTimeInMs=0.05
documentLoadTimeInMs=0.00
systemFunctionExecuteTimeInMs=0.00
userFunctionExecuteTimeInMs=0.00
retrievedDocumentCount=0
retrievedDocumentSize=0
outputDocumentCount=0
outputDocumentSize=49
writeOutputTimeInMs=0.00
indexUtilizationRatio=0.00
x-ms-global-Committed-lsn
118851
x-ms-number-of-read-regions
0
x-ms-transport-request-id
12
x-ms-cosmos-llsn
118852
x-ms-session-token
0:-1#118852
x-ms-request-charge
2.86
x-ms-serviceversion
version=2.2.0.0
x-ms-activity-id
c4bc4b76-47c2-42e9-868a-9ecfe0936b1e
x-ms-continuation
{"token":null,"range":{"min":"05C1DFFFFFFFFC","max":"FF"}}
x-ms-gatewayversion
version=2.2.0.0
Date
Fri, 05 Apr 2019 05:40:21 GMT
Content-Type
application/json
As you can see the header clearly states that one document is returned but the response is empty.
If we continue the query with the composite continuation token we can see the result.
Is that a normal behavior or a bug?
Hi,
I experienced the same issue just the other week. After several weeks, this is as intended.
I wanted to know what happened "under the hood" and I was given the following answer from the CosmosDB Product team:
"If you鈥檙e curious about the implementation, we keep track of intra-partition and inter-partition progress. In this case, the null indicates that one partition is done. You can find more details in source code of the Java driver: https://github.com/Azure/azure-cosmosdb-java/blob/ae692357f9af0329361977f2ba0927b2d1764840/gateway/src/main/java/com/microsoft/azure/cosmosdb/rx/internal/query/ParallelDocumentQueryExecutionContext.java
But that said, we highly recommend treating the token as an opaque string, as the implementation may change under the hood. You should be able to echo and resume any query by doing this."
In other words, I suspect that your partition is searched by your client, and you get the token back. And you can't be entirely sure that there are no more documents, that's why you get the token and must continue scanning.
Also, where do you clearly see that one document is returned? I can see that the header x-ms-item-count indicates zero (0) and the documentsCount is minus one (-1) ?
Hope it helps!
@luddskunk thanks for your reply.
I believe I misread the data. Sorry about that. As you said the document count is (-1). There is however a documentSize=184.
Anyway, at the moment we are echoing back and continue the query.
The issues are:
@aboo Hi again,
Regarding your issues we have faced the same situations.
I have some comments.
we didn't know about this behavior and our client received an empty list with a continuation token and pretty much broke our flow.
Now in server side we are handling this situation and continue until we get result. The issue is what if there are 1000 partitions. Do we have to continue 100 times? Is that how ComosDB protect their SLA and under 10 ms response time. $ThinkingLoud
Yes, this broke our flow the first time too. But we handle it on the server side as well (Together with MaxNumberOfObjects we continue to serve request until we receive the number the client wants), and the pattern you're seeing is due to the underlying architecture of CosmosDB, consisting of phyiscal + logical partitions. It sounds like you're implementing Paging with the interplay of your client, and that is fine. However, I don't think this is what CosmosDB refer to with their SLA times.
What other undocumented unexpected behaviors are there that are going to get us by surprise in production environment? #ThinkingLoudAgain
This is a bit vague, but my advice would be for you to read up on all FeedOptions together with CosmosDB Performance tips and make sure you understand them as well as their application areas.
EDIT: Also, another warning. I am currently running into an issue with Continuation Token and DISTINCT keyword in the SQL query. It does _not_ work as expected without an ORDER BY.
Adding Stack Overflow link created for same issue.
I believe I have encountered this issue as well starting with v2.2.3 (verified it was not present in v2.2.2). Our code automatically retries the query, but it's significantly slower than it used to be since we're now doing hundreds of round trips back and forth from client to server. An example is a typical ~1.5s query is now ~9s on average (6x slower) when switching between SDK versions.
I've dug for a few hours and I can see the following before/after behavior (over HTTPS via Fiddler):
v2.2.2: 200 HTTPS calls (1 per partition key range)
v2.2.3: 20,301 HTTPS calls (201 + 200 + 199 + 198 + 197 + 196 ...)
Watching the system with v2.2.2 a call to await query.ExecuteNextAsync<TOut>(); would block until all partition key ranges had responded.
Watching the system with v2.2.3 a call to await query.ExecuteNextAsync<TOut>(); appears to return as soon as any partition key range responds. Consequently in application logic since HasMoreResults will be true we circle back around and re-issue the query with the given ResponseContinuation which seems to re-issue the query for all partition key ranges minus the one that completed. This continues in a loop causing the significant increase in requests mentioned above.
I can see this behavior in v2.4 of the SDK as well.
Some help would be greatly appreciated! The changelog for v2.2.3 indicated only
Diagnostics improvements
so I'm not sure what may have changed with that version. Thanks for your help!
We encountered this problem - last page would return a non-null continuationToken and subsequent query would return empty result. It seems to be resolved by using the new V3 SDK.
<PackageReference Include="Microsoft.Azure.DocumentDB.Core" Version="2.8.1" />
<PackageReference Include="Microsoft.Azure.Cosmos" Version="3.2.0-preview2" />
@gorillapower related issue
Any update on this issue? I just updated to the latest version and am still getting the error
It kind of explained in the query troubleshooting guide
@j82w Thank you, I think I am missing something though. Which section should I be taking note of to avoid the malformed token?
- Sometimes queries may have empty pages even when there are results on a future page. Reasons for this could be:
- The SDK could be doing multiple network calls.
- The query might be taking a long time to retrieve the documents.
- All queries have a continuation token that will allow the query to continue. Be sure to drain the query completely. Look at the SDK samples, and use a
whileloop onFeedIterator.HasMoreResultsto drain the entire query.
There is also more of an explanation here: https://github.com/Azure/azure-cosmos-dotnet-v2/issues/376#issuecomment-342622696
@j82w that makes sense. I think maybe my issue is more related to This issue in the v3 sdk. Ill go there.
@j82w I think I have encountered this same issue in SDK v3. We're receiving a continuation token that returns zero results. Is this ticket meant to track the issue in both SDK v2 and v3, or only v2?