Aws-sdk-java: Doubts on DynamoDB queryPage combined with Limit and a queryFilter or a FilterExpression

Created on 8 Jan 2015 · 8Comments · Source: aws/aws-sdk-java

I am querying a GSI index in Dynamo where I have a HashKey and RangeKey defined, I would like to limit the results to a small number (let's say 5) and I need to add a FilterExpression so that when the user queries this index he does not get his own records.

The problem I am facing is that it looks like the limit gets executed before either the queryFilter or the filterExpression. So, basically if one out of the 5 records happens to be affected by the queryFilter or the filterExpression it gets discarded and I end up getting less than 5 elements when looking at the getResults method.

Is this the expected behaviour from the library? A DynamoDB restriction? Is there any way to have the filter executed after the limit (apart from obviously checking the size and querying again or add a higher limit)?

Any help would be much appreciated

Source

ricardclau

❤3

Most helpful comment

Thank you very much @david-at-aws for the confirmations and detailed explanations!
I think we can close the issue and hopefully this thread will be useful for anyone having similar doubts like the ones I had.

Thanks!

ricardclau on 9 Jan 2015

❤2 👍1

All 8 comments

Hi @ricardclau,

Yes, IIUC this is the expected behavior from the library. The service API currently only exposes a single "Limit" parameter that can be specified to restrict the size of the results. See the "Limit" section of:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html

In particular, the Limit is always applied before filtering on the server side, not after.

Hope this helps.

hansonchar on 8 Jan 2015

One quick way to handle this is to use query() instead of queryPage() and only read the first 5 items from the returned list. query() handles pagination for you, so it'll automatically make another call to DynamoDB if needed to fetch the additional records.

// Limit of 5 is passed through to the service on each query; we might get fewer than 5 results though.
List<MyObject> objects = mapper.query(query.withLimit(5));

for (int i = 0; i < 5; ++i) {
    // get() will make another call to DynamoDB if the first page of results didn't include enough
    // results.
    MyObject obj = objects.get(i);
    ...
}

david-at-aws on 8 Jan 2015

👍1

Thanks for your replies @hansonchar & @david-at-aws

Yes, this is what I suspected, I had some minimal hope that perhaps queryPage would be slightly different than query in terms of the filter applying but this is not the case :)

Regarding using query and then looping, please correct me if I am wrong but query would usually get more elements than my needed 5 (and this is why queryPage was introduced in the API), so it would increase the read throughput consumption, right?

ricardclau on 9 Jan 2015

I have done some further research activating full debug logs for org.apache.http.wire and this is what I found (again please comments and correction very much appreciated).

The http request triggered by mapper.query with a setup similar than mine indeed retrieves 5 elements or less. So, my previous assumption was wrong.
However, if we try to access the PaginatedQueryList size() method it needs to scan everything (makes sense but good to know because this would clearly affect the read throughput). And this is what mislead me.
The for loop trick seems to do the work, and I can see a second http request retrieving again 5 elements or less when the first request had scanned 5 items, the only problem seems to be when the total scan size() is smaller than 5, where I get an IndexOutOfBoundsException if trying to access an element which does not exist

Would a try / catch for that situation be the best way to control the edge case or is there a cleaner way of doing it without having to full scan?

And related to all this, what would be the difference / advantages of queryPage over query? The PaginatedQueryList seems to be much smarter although you need to be careful not scanning the full table. Is that the main reason for them both to exist?

Anyway, I strongly recommend everyone activating this debug log to see what is actually going on vs the app and dynamo, it is very enlightening and helps making sense of everything :)

ricardclau on 9 Jan 2015

As you've discovered with the wire logs (which I agree are a great way to understand what's going on under the hood!), PaginatedQueryList is lazy - it'll (usually) make individual Query calls to DynamoDB one at a time as you ask it for more data. size is the exception - it needs to keep calling Query until it has seen all of the query results to report an accurate size.

If there may be fewer than 5 total results and you want to avoid the try/catch, you can use an Iterator, whose hasNext method lets you test whether you're at the end of the list without having to get the size of the list up front:

Iterator<MyObject> iterator = mapper.query(query.withLimit(5)).iterator();

for (int i = 0; iterator.hasNext() && i < 5; ++i) {
    MyObject object = iterator.next();
    ...
}

Yep, the main advantage of scanPage/queryPage is that it's harder to shoot yourself in the foot by accidentally performing a query that consumes a LOT of capacity. With scanPage/queryPage, the 'Limit' parameter sets a hard limit on the maximum amount of capacity that the call will consume, and you have to explicitly write the loop to make multiple calls if you want to use more capacity to get more results. Query/scan are easier to use since they handle this pagination for you automatically, but you may end up accidentally consuming a lot more capacity than you meant to if you don't think through the implications.

david-at-aws on 9 Jan 2015

❤3

Thanks!

ricardclau on 9 Jan 2015

❤2 👍1

@david-at-aws Hi i am using the new aws sdk for PHP. I have the same problem here. I want to apply QueryFilter with query and limit. You said for java sdk we can use query() for this instead of queryPage().
In php aws sdk we have only query() method and it doesn't recursively called if the result set is less than the limit. How can do this in php sdk.