Is your feature request related to a problem? Please describe.
I think it's way past due that Amplify supports total counts in GraphQL queries. Other GraphQL based platforms has this built in by simply adding totalCount
(or similar) in the query and no matter the limit, they'll get back the total count in addition to the (filtered) data.
Describe the solution you'd like
This should at least work for DynamoDB backed models and of course also for search based queries that pass by ElasticSearch.
Describe alternatives you've considered
Making a Lambda function that is a field in each model using the @function
directive, but since we are both using listItems
and searchItems
with filters added, the implementation is not simple as we have to reapply those filters on the lambda function for getting the correct count.
Making custom resolvers seems like another not very fun route and not very scaleable or maintainable, and once again, this should be a "out of the box one liner" to have available as a developer. With either a Lambda or some other custom resolver I'm looking at hours or days of development.
Additional context
This is a must have feature and there's not really any workaround for displaying total counts for systems with many items — at least that I know of. I read several bug reports, but none of them seems to have a simple solution. That it has not yet been developer by AWS is beyond my understanding, as pulling out counts is one of the most common things to do when developing web apps.
There is a workaround for this and a reason why it is not automatically included as part of amplify - DynamoDB does not have an efficient way to count items, and can't in general perform aggregate operations. If you want to count items, write a custom lambda to read from the dynamo table stream and conditionally increment or decrement a counter on the table. Add a lambda resolver to read this counter value from the table, and add it to your schema.
Because dynamo can't perform aggregate operations, and you potentially want to get item counts after a filter expression is applied, having the cli provide this feature is a tall order.
Yeah, I aware of DynamoDB limitations, but asking the developer to develop correct counts is also a tall order. As I can have any combination of filters, I would need to count based on that and that seems like a huge implementation.
A count of items is something most apps use, so while I may be able to develop something that works decently, this feature request really is about either DynamoDB fixing this or Amplify creating some kind of sensible workaround that does not put all the workload on the developer.
Thanks for your feedback, the things you outline has been in my thoughts in relation to a workaround on this, but trying to come up with a sensible workaround that respects any type of filter on any table seems like something I could spend a week or two developing and may still be prone to wrong counts in edge situations.
Maybe ElasticSearch has counts of total results, even if there's a limit applied and this could be included in results as a decent workaround? We use ElasticSearch for most queries, as the default filters are too limited anyways.
Elastic does provide this information, you can access it in the resolver template with: $ctx.result.hits.total.
There is a difficult disconnect between graphql and dynamodb development mindset. Dynamo DB requires planning of most access patterns in the design phase, while graphql makes an iterative approach seem straightforward. While amplify backed by dynamodb does offer some flexibility, it does require more ahead of time planning than other platforms, in this case with aggregates. If you know your aggregates now, and know they are stable, development of lambdas is doable. Solving the general case sounds much more difficult as long as your data is backed by dynamodb.
I agree that it would be helpful to have templates to speed up creation of lambda aggregate creators (including backfilling data).
Elastic does provide this information, you can access it in the resolver template with: $ctx.result.hits.total.
That sounds promising. Any hints on how I could create a @function
directive to map to a resolver returning the total hits, without having to overwrite my default resolvers? I have not yet done any custom resolvers or overwritten in this project and I'm trying to avoid that. I have done a few @function
directives that run a Lambda for other things (and that works well).
Elasticsearch by default returns the total hit count, and by default its accurate up to around 10,000. So if you make a rest api call in your lambda you will get this back as a response. Just update you schema to include it as a property. You will likely need a new type to hold the items, nextToken, and total property.
I do think that $ctx.result.hits.total
should be exposed by default since it is readily available.
In terms of DynamoDB support, I'm not too sure. I mean it would be nice to have, but if you have a large dataset just getting the count on every request will be expensive. Because every time you go over 1 MB of data, you have to paginate through to get totals. In reality, you could scan through a table made up of millions of records before you can get a final result, which obviously makes no sense even in terms of performance.
If we were to build this into Amplify in the future with DynamoDB, could you give a bit more detail on some things:
SELECT COUNT(*) FROM ...
Any other specifics on your requirements or use cases would help us look into this in the future.
@undefobj
I would say it would have to be the total count of items matching a query. This would be primarily needed for pagination purposes, it's very hard to give a good user experience if we are supposedly showing X results at a time out of Y number of records. This would also need to work with custom indexes, as most likely that's the most efficient setup when listing tables unless we are using searchable.
I wouldn't mind if there are additional lambda functions added to the GraphQL Query, however, I think it would be wise if we are able to choose whether to request/run this or not. If we are paginating across 1000 results with 10 results a time, I don't think we need to necessarily run the total length query 10 times, it could be that the first request is enough. Also, there could be instances where I wouldn't care much about the count and as a result wouldn't want to run the lambda function for no reason. Potentially another way to go around this could be to have a 'cache'/dynamo table store for counts; with rules of how long count results may be valid for.
I hope that makes sense.
I agree with @jonmifsud — the main interest for us is to have a count of the filtered query, so we can do a proper pagination and show the total results of that query.
A full count would be nice, but if you do a normal query, shouldn't that result in a count of all items? Also, maybe a way to just return the count would be nice instead of having the count as a part of the result dataset would be nice.
I also would like to be able to choose when to use this and when to not, to avoid extra processes running with load time and costs overhead.
You should be able to do pagination using nextToken
without needing to know the count. However, what you propose above to have a count on the query results would be difficult. While not impossible, this feature would be quite difficult to do. I'll keep the issue open so that we can get other community feedback and measure interest, but for transparency it would need a lot of design and thought and wouldn't be delivered anytime soon.
Associating the word "impossible" for a query count in 2019 makes me cringe a bit. And more than that, it makes me wonder if selecting Amplify and all its (current) dependents was a very wrong choice.
The fact that DynamoDB does not do counts for its queries (besides a full table estimated count every ~6 hours) is simply a limitation the team working on DynamoDB should solve. Every single competitor to DynamoDB handles this without issues, so I'm sure those smart people can also come up with a solution that does not just benefit AppSync and Amplify users, but also people using DynamoDB directly. Maybe it will be near correct counts if millions and more precise when thousands like MySQL / InnoDB, and that would be way way better than having no clue whatsoever.
I am aware that using the nextToken I can make pagination but that paginator is somewhat less cool to look at from a UX perspective as I won't be able to show 1, 2, 3, 4....12 because I don't know how many pages I have. When someone wants to know how many items to we have fitting this filter, it cannot be that I have to pull them all out (only the id field) in the leanest way, and then count the array client-side?
I'm sure AWS compares themselves in some ways with other GraphQL services like Prisma etc. and they don't seem to have a problem supporting this.
This is a DynamoDB limitation. Attacking a solution on top of that for AppSync is the wrong angle, this needs to end on the table of AWS DynamoDB developers so they can come up with a sensible solution nearer to the root of the problem — everything else is a hack. Asking me to keep counts in a model/table myself when things update is even worse and not what you'd expect of a platform with an otherwise impressive feature set.
And if it's not possible for DynamoDB to solve this, then the Amplify / AppSync team should start considering built-in support for other major database players such as MongoDB, MySQL, Postgres, etc. so they are not being held down by a half-baked database that is backing the entire thing, but when that is considered, I am sure it looks way more interesting to just figure out a solution to counts and other minor limitations DynamoDB currently has.
@undefobj What I would like to see to support this type of feature is to utilise kinesis firehose and glue to send data in Redshift or S3 Parquet. Then I could connect to Athena as a serverless analytics system, query Redshift for a non-serverless solution, or have another lambda pick up the objects from S3 and send them into Postgres. Amplify is well placed here as the api category can pick up on schema changes to re-run glue, amplify can make life easier creating a new firehose for each table, and setup the lambdas to put dynamo stream data into firehose. I realise its a big ask, but the question can be generalised into "how can amplify better support analysis workloads"
I see a smaller and faster win on just providing an aggregation template lambda to deal with uncomplicated counters. I would not like aggregation counters enabled by default, as I don't want to pay for what I don't use.
@undefobj What I would like to see to support this type of feature is to utilise kinesis firehose and glue to send data in Redshift or S3 Parquet. Then I could connect to Athena as a serverless analytics system, query Redshift for a non-serverless solution, or have another lambda pick up the objects from S3 and send them into Postgres. Amplify is well placed here as the api category can pick up on schema changes to re-run glue, amplify can make life easier creating a new firehose for each table, and setup the lambdas to put dynamo stream data into firehose. I realise its a big ask, but the question can be generalised into "how can amplify better support analysis workloads"
We do support Kinesis Streams in the Analytics category today: https://aws-amplify.github.io/docs/js/analytics#using-amazon-kinesis
I have to check on the PR for supporting Firehose but I know it's in progress.
This seems to be an independent ask of the issue in this thread though as this is related to ingest. Total counts in a decoupled systems wouldn't be accurate on the pagination against DynamoDB as you'd run into consistency issues. For the analytics requests I think this would be a good ask in a separate Feature Request issue.
I wasn't as interested in strong consistency or using it for pagination, but the total count calculation and getting aggregations in general. The ask seems to be about bringing other database system capabilities into amplify, which is where ddb stream -> firehose would come in handy as a building block.
This got sidetracked by different ideas, but I'd like to know if we can expect total counts for at least Elasticsearch based queries soon?
@undefobj $ctx.result.hits.total
is in the resolver response template already, so there should be very little code needed to get this wired in all the way. This would help with a decent workaround for overall total counts for now. It seems doing this against normal list queries against DynamoDB is not anything to expect soon.
It seems like #2600 made #2602 happen, so here's to hoping that PR gets accepted fast.
@houmark isn't that PR related to @searchable
attribute? I want to know how many items are in a collection in total.. don't know if that PR solves this issue.
For example, a user has many books in their library, and I want to show on their profile page that how many books are there in their library. I don't want to load all of their books, just to count it. It might be thousands.
Yeah, I first thought that PR would solve it due to the original code that was halfway baked in at the time. But that PR changed to just show the total of the returned items which is more or less useless because it's very easy to count the amount of results returned client side.
I don't think the Amplify team is working on a total results value due to the limitations and complications of passing through @auth
.
@houmark I wouldn't call it "useless" since if you're getting paginated results say 10 results at a time, you might want to get the total hits (if you have more than 10 results) to display information like the number of pages on your UI.
@kaustavghosh06 But how can I do that? I get 10 when I limit
10 in the query. If I limit
20, total
is 20? If I am on the "last page" and limit
20 but there are only 5 results left I get 5, but all of that I can do with result.whatever.items.length
client-side.
If I try to pull out 1000 then I hit another limitation which is the result set being too large (don't remember the exact error, but it errors, and if I query a lot but not too much to hit the limit, then it takes, of course, many seconds for result which in a UI leads to racing situations, and I cannot cancel queries due to another limitation, which has a PR but that PR has been stuck now for a good long time). Anyways, If I have 3k items, then it will still give me 1000 in total
, so that does not really help me for pagination does it?
Am I missing something here?
@houmark Do you have @auth on @searchable?
I'll speak to @SwaySway about this behavior.
I would expect the total hits to be available to the client for pagination purposes.
That was what I expected also, but this changed in the PR because it would leak data about the total results if @auth
is filtering out results due to "access control".
See his comment here: https://github.com/aws-amplify/amplify-cli/pull/2602#issuecomment-543844821
For us, first of all we would not have a leak because we don't have advanced access control in our models, so we would get the right amount of results, but even if we did, I would have liked to at least have the flexibility to allow "leaking" and getting the result as a developer. This could come with a warning by Amplify, but as there's no alternative to get totals in any type of way in Amplify (besides maybe writing and maintaining resolvers ourselves which seems somewhat a bad path) it would be nice to have some indicator to be able to show in the UI (for example "around 2.604 results").
Of course, the optimal solution would be that @auth
hooked on to the result and added in the total
after filtering, but I understand that's a major undertaking and would potentially slow down all queries.
I'm flabbergasted. How can a database system have a support for pagination, but give no information regarding the total results? How do you make the pagination UI then? Sure you can show "Next" until there isn't a nextToken anymore or do infinite scrolling, but those don't work for everyone. I almost cant believe it that DynamoDB can't give you a simple count.
Too bad i wasn't aware of that limitation when the project had started. Looks like I have to switch over to RDBS like Aurora and do all the hard work of custom resolvers.
@psporysz dynamodb doesn’t prevent this UI pattern, it only requires that you do additional work. You can setup a streams trigger to keep a record count yourself. You can then fetch the count alongside the fetch for data to display a total count.
Or, you can sort the items with a monotonically increasing key and have a stream trigger insert a whole number count on the item, this would then allow a skip and limit pattern on a secondary index.
Unless there are other significant reasons, DynamoDB can still handle many patterns with a bit more work.
Or, you can sort the items with a monotonically increasing key and have a stream trigger insert a whole number count on the item, this would then allow a skip and limit pattern on a secondary index.
Let's say i have a secondary index for date field and I would like to know the number of items in between some date range. Can you explain how to apply above technique in this case?
@psporysz dynamodb doesn’t prevent this UI pattern, it only requires that you do additional work. You can setup a streams trigger to keep a record count yourself. You can then fetch the count alongside the fetch for data to display a total count.
Or, you can sort the items with a monotonically increasing key and have a stream trigger insert a whole number count on the item, this would then allow a skip and limit pattern on a secondary index.
Unless there are other significant reasons, DynamoDB can still handle many patterns with a bit more work.
What happens with 2 simultaneous posts?
Is this safe under DynamoDB?
Can anyone just give an example of how to get the the total hits from elastic search?
You can't get totals from Elasticsearch, Amplify/AppSync strips that information as the @auth
directive can be filtering the final result of items if the user does not have access.
Effectively, at this time, there's no out of the box way to get even close to a precise count of a query result.
Is there any update on this issue? I'm just trying to get the total count on ElasticSearch
using @searchable
, but the service is not returning the total count. It just returns the paginated total count. If I send a no limit, then the total count becomes 10, since it seems to be the default limit... I'm just trying to count the number of items hit by some filter...
In _AppSync Console_, I've tried to add a total in the schema
type ModelUserConnection {
items: [User]
nextToken: String
total: Int
}
And attach a Resolver to the ModelUserConnection -> total
In "Configure the request mapping template." section
{
"version" : "2017-02-28",
"operation" : "Scan",
"filter" : {
"expression" : "#return( $util.toJson($ctx.source.scannedCount) )"
}
}
In "Configure the response mapping template." section
$util.toJson($ctx.result)
then I can query the total
query userList{
listUsers{
total
}
}
result would be
{
"data": {
"listUsers": {
"total": 6
}
}
}
I'm new to Amplify and AppSync, so not very sure whether this is a correct method to do this.
Reference:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.Count
https://docs.aws.amazon.com/appsync/latest/devguide/resolver-mapping-template-reference-dynamodb.html#aws-appsync-resolver-mapping-template-reference-dynamodb-filter
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LegacyConditionalParameters.ScanFilter.html
I've tried to add a total in the schema
type User @model { id: ID! name: String! total: Int }
In _AppSync Console_, attach a Resolver to the ModelUserConnection -> total
In "Configure the request mapping template." section
{ "version" : "2017-02-28", "operation" : "Scan", "filter" : { "expression" : "#return( $util.toJson($ctx.source.scannedCount) )" } }
In "Configure the response mapping template." section
$util.toJson($ctx.result)
then I can query the total
query userList{ listUsers{ total } }
result would be
{ "data": { "listUsers": { "total": 6 } } }
I'm new to Amplify and AppSync, so not very sure whether this is a correct method to do this.
Reference:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.Count
https://docs.aws.amazon.com/appsync/latest/devguide/resolver-mapping-template-reference-dynamodb.html#aws-appsync-resolver-mapping-template-reference-dynamodb-filter
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LegacyConditionalParameters.ScanFilter.html
It is a solution yes. But a scan operation is expensive ans slow. If you have 6 items, its okay. But if you have 100k its not good to use this option.
@davidbiller
Thanks for the reply. In your opinion, what is the good option? Thanks!
@danielcai1987 scan is limited to reading 1MB of stored data at a time, if you ever expect this table to hold more data, the above solution will not work
I can't believe this is still unresolved
This is a technical limitation. Don't think anything's gonna happen here anytime soon.
Anything on this?
Yeah.
Any kind of app will require this unless the solution is for playground only
It's possible to have scannedCount and count as result from the GraphQL query
read here
Thanks. I wonder why is there no official documentation about this.
Thanks. I wonder why is there no official documentation about this.
Because the 1 MB limit. So it’s not really a Option.
I hope that this technical limit from DynamoDB gets solved, but modifying the Graphql scheme is not an option for me, in the meanwhile I've created a new DynamoDB table with the owner and count fields, this table is updated vÃa lambda trigger after each insert/delete in the other DynamoDB table. After that when I need to know the total count by owner, I do in parallel the query to the counter table and to the data from the other DynamoDB table (limit of 10 results), with this I have isolated the counts by owner without to do a full scan and I have the total pages and total elements available for a good UX experience.
P.D: You could use the @function in the Graphql scheme to do that.
Maybe this solution could helps to you.
Just finished reading through this steam of comments. Many of you guys have done a ton of research, thank you! Really saved me a lot of time!
I'm gonna join the choir here, and ask that this issues is being recognised soon. It's really a pain to not have this functionality.
I've chosen to follow the path with a separate table where i save the aggregated (incremented) results base of the triggers. Not perfect, and dosen't solve the totalCount with filters like others have mentioned.
I've tried the @ricsirigu suggestion here but when limit = 10 i have:
{
items: [10 items],
nextToken: "asdasdasd",
scannedCount: 10,
count: null
}
The length of my result set (without filtering) is about 200 rows and the full table cardinality is >200k rows.
¿Any suggestion?
My query is based on a Secondary Index named "byMerchant" and the query is "transactionsbyMerchant"
My schema.graphl
type ModelTransactionConnection {
items: [Transaction]
nextToken: String
scannedCount
count: Int
}
type Transaction @model
@key(name: "byMerchant", fields: ["merchant"], queryField: "transactionsByMerchant")
{
id: ID!
merchant: ID
...
}
@Hernanm0g limit is applied before filter in DynamoDB, this is the opposite from other databases. Using the blog post to pagination will not work. https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
Some more information. DynamoDB will only ever read 1MB of data from disk from a query or scan. You cannot find the total count by reading the items from disk. If this were not the case, you could potentially ask DynamoDB to read 1PB of data and degrade performance for other users.
Counts are a type of database aggregation that you must manage manually with DynamoDB. Consider using a table stream lambda to manage the counts manually.
I hope that this technical limit from DynamoDB gets solved, but modifying the Graphql scheme is not an option for me, in the meanwhile I've created a new DynamoDB table with the owner and count fields, this table is updated vÃa lambda trigger after each insert/delete in the other DynamoDB table. After that when I need to know the total count by owner, I do in parallel the query to the counter table and to the data from the other DynamoDB table (limit of 10 results), with this I have isolated the counts by owner without to do a full scan and I have the total pages and total elements available for a good UX experience.
P.D: You could use the @function in the Graphql scheme to do that.
Maybe this solution could helps to you.
Glad to see a lot of research has already been done on this issue, really hope to have it resolved soon. In the meantime, this solution you provided really seem ideal for my current use case, however I am quite new to aws. If it's not much trouble could you further elaborate on the hint you provided here as to how to make a trigger out of the @function directive which I presume will be updating the count table by listening to the insert&delete event in the source table.
Is this already considered by any AWS team? Some words about it will give us hope to this limitation. Thanks!
I've been thinking about this one as its an area in our product development we are now getting to. I'd like to warm this up with a few options that hopefully might lead to something. Like a lot of others I'd just assumed that this would be trivial, a fault of my not understanding document databases well enough I guess. Its a bit of a trap to fall into when so far down the line of developing, try telling your client/manager that you need 5 days to get out the total records for something...its not a good look.
One issue with this whole thread seems to be the assumption that whatever the solution it has to cope with zillions of records when in fact a lot of apps might only be counting in the 100's or 1000's in some tables so I don't think there needs to be a one size fits all solution. I get that some of those solutions could cause some big issues if mis-used but with some knowledge of any constraints or suggested working parameters that should be safely navigated.
I'll start off with the smallest (and some might say naive) solution first and build up to the bigger ones. Remember, there isn't a one size fits all solution. My focus is on trying to make it easy to add this functionality with minimal hassle from the "developer building out the schema" point of view.
If you don't have many records (< 1000 or so) you might be able to get away with using the scannedCount
field as described here https://www.riccardosirigu.com/blog/aws-amplify-dynamodb-total-count-graphql-queries/ although it can "top out" when it hits its 1MB result set limit and so report a false total. Ideally it would be nice to know if there are more that could have been returned or if that is the total amount i.e. was there another page of results possible? That would at least give you the option of saying "record 1 of at least 246+" or something if we can't get an exact amount back.
Another option here is to do a full paged scan if I opt into it so I can get a "true total". This would loop through all the pages of results for a scan and return the total. I keep hearing that scans are expensive but again if I've only got a relatively small number of records that might be a hit I'm happy taking for now and just swallow the hit and push off doing it "properly for epic scale" in the future when it becomes a happy problem (ie we've grown so much we need to change it).
I can see this being a new directive that could be added to a model, something like:
type Posts @model {
title: String!
content: String!
author: String!
published: Boolean!
totalPosts: Int! @scannedCount
}
This new @scannedCount
directive would automatically do a scans of all the records in the Post table that could page through all the results until it finishes (thus masking from us the issues with the 1MB returned data limit). Used as shown above it would just give the total of all records for that table.
Now if we ignore the obvious performance traps that this directive sets up for us (we will discus those next) I think we can all agree that from a users point of view that looks like a simple and clean way of getting what we want, its a super quick way to get a total value out, problem solved from the users point of view initially. For fast prototyping or "good enough" features that might do me.
So now on to the problems this will cause or that need solving. Luckily now we have a directive we can add to it with some attributes to give us some finer grained control or indeed we could create multiple directives to give us the following functionality.
A quick note on Lambda functions, we are going to need some generating, would it be worth while if these are generated and stored in the /backoffice/functions/ folder as user generated functions are. That way should a user want to extend the function with additional logic they can. The functions this directive might generate should be all nicely stubbed out with pointers on where and how to add additional logic and would need pushing to deploy like any other user generated function. This directive shouldn't hide all the magic in the cloud and VTL templates, it should do the minimum to make the common case go a way and then give the user an extension point to add additional logic as their needs demand. This might mean that removing the directive from the schema won't delete the resources from the cloud automagically if you've edited them, I'd be happy with a warning about "Some functions appear to have been edited so you will need to manually delete the following or (A) delete all anyway".
If left unchecked this could get out of hand with the number of scanned pages growing over time so as a fail safe we should have a default number of pages that this will loop over before returning. The default can be agreed later but if the user wants to opt-in to setting it then they could in the directive like so:
totalPosts: Int! @scannedCount( maxPageDepth: 5 )
The maxPageDepth would be an optional param to set how many pages you are happy to do to get that total to stop it all grinding to a halt in the future (but hey we opted into this solution, its not ideal its pragmatic and we know the trade offs).
The trouble with the above is the same as the scannedCount solution from the blog post, it doesn't tell you if it reached the end of the pages or not, so you've now way of knowing if the reported number of results is the total amount or just the amount it got before hitting maxPageDepth. But it might be good enough for the right sized app and it sure is simple to do from a users point of view. As long as the limitations are documented they users should be able to make an informed choice.
Additionally maybe a trigger in CloudWatch could be created as part of using this directive to ping a notification when a field is taking too long to rescan? A nice bit of Developer Experience and another way to introduce a new feature to those of us still learning our way around what AWS has to offer.
I want something other that the total count, I want a count of some filter or other. This ones a bit trickier but could we allow filters in the directive?
totalPosts: Int! @scannedCount
totalPublishedPosts: Int! @scannedCount( filter: "published=true") #something like that anyway
That would work but its not dynamic, I can't pass in a filter at runtime but it would be good for various counters/stats that could be handy to have. I wonder if we could generate a GraphQL query that would allow you to pass a filter through to this directive/resolver (the lines become blurred as to which this now is at this point, guessing its a resolver under the hood but then I don't know enough to know what a resolver does under the hood as yet so my terms might be wrong here, feel free to enlighten me).
Sort keys could be passed through too maybe to give you a total for a subset in a similar manner, its not my strong point so this part of the ideas is hazy on purpose but here to get you thinking.
Another issue that @houmark raises repeatedly is that the number of results that match a filter can only be worked out if we scan all the records and also it doesn't work if we have a limit. This @scannedCount
would do both I think in its own brute force way.
Do we really want or need to look this total up every time? If its a small number of records then why not, it wouldn't be all that slow nor costly (again we are talking about relatively small number of records here not eBay products or Uber rides size apps) and we could cache it on the frontend if needed.
But if you want to opt in to caching we could do it one of two ways that could be set via the directive at the expense of another table and a function.
You could do this with a simple boolean or enum (to allow for more options in the future):
totalPosts: Int! @scannedCount( cached: true )
I'm not 100% sure of the internal here so I'm going to propose what I think it could do with my understanding of the options, there might be other better ways but this should give the gist. We need to store the value somewhere and then pull it back out.
One way would be to do this under the hood:
@scannedCount
directive has already created this table we can just re-use it? Along with storing the total would be also store some other fields like model name, target field, createdAt and updatedAt.This version uses a Lambda trigger function using DynamoDB Streams (one of the options when creating a function via the CLI so a known thing) that hooks into this models table updates and will rescan the table whenever a record is added/removed. This is a bit brute force but it would work.
The trouble with his method is it could cause needless thrashing of the tables when doing an import, we don't need that number generated on every entry if we are adding 100's or 1000's of records. I might be that you can live with the hit if the numbers are small but an option to disable the caching while doing an import but still trigger it to do a final count at the end would be good. I sadly don't know enough of how to do that nicely. You could add a "disabled" boolean attribute to the directive maybe but that would need another push to get it up to disable it and then how to get it to recalculate after enabling it again without just "touching" an existing record afterwards? That might be enough?
totalPosts: Int! @scannedCount( invalidOnCRUD: true ) # open to suggestions on any of these names btw
Another option is similar to the above except we don't do the rescan when the record changes we just null the value in the cache table. Then our resolver that gets the total for that field will do the rescan for us if the value is null. Depending on the data model and workflow of the app this might be a better fit.
My only worry here is that we separate out the two bits of cache work to two separate areas, the function to clear the cache and the resolver to build the cache.
An alternative if we are ok with the totals potentially not being 100% accurate at any given time is to have a timed Function do the rescan repeatedly after a set timespan. Again this could be set via the directive:
totalPosts: Int! @scannedCount( refreshCountInSeconds: 300 ) # regenerate every 5 minutes
I don't know the best practise for passing timespans around so that is up for debate but hopefully you get the idea.
How we get that cached value out into that model though I'm hazy on, I guess we just have a resolver generated which would go get the value from the "cache table" but better minds than mine will know how best to do that.
Rather than having these fields on the model themselves do we need a PostTotals
model generated instead that can store all these top line numbers on and possibly pull these from the cache table via a resolver or when generated by amplify codegen
etc? In that case does the directive move up to the whole table rather than the field level?
type Posts
@model
@scannedCount( fieldName: "totalPosts" ) {
title: String!
content: String!
author: String!
published: Boolean!
}
What if we out grow the @scannedCount directive above, what options do we have? Scanning is too slow due to the number of records we've got. This might be where a @runningTotal
directive might be the answer.
It has been suggested above that you can keep a rolling count of various totals by hooking into the "DynamoDB stream" which will fire an event every time you add/update/delete a record in DynamoDB. We could then use these to increment or decrease a totals counter.
This solution though has some instant problems:
I'd like to see a directive that could wire all this up for me from the schema, this would spin up the functions needed (stored locally for editing) and store the values in a separate table similar to the cache table mentioned earlier if not the same table but different records so we don't clash? Lambda trigger functions hooked into DynamoDB streams would do the adding/subtracting. As these are built to run all the time we won't have the same potential scaling/performance problems we had with the @scannedCount
directive.
type Posts
@model
@runningTotal( fieldName: "totalPosts" ) {
title: String!
content: String!
author: String!
published: Boolean!
}
Now what about that idea to have multiple counts on a table? Could we do this:
type Posts
@model
@runningTotal( fieldName: "totalPosts" )
@runningTotal( fieldName: "totalPublishedPosts", filter: "published=true" ) {
title: String!
content: String!
author: String!
published: Boolean!
}
I don't see why not. Need more power? They you should be able to go edit the function code or copy it and now you've got a great starting point to adding in a few more lines to get you more counters as needed. Easy of development continues and I think we are still using all the tech that Amplify already has and uses so not a massive stretch of knowledge or comfort area?
Getting into wish list land with this last one, suggest the above gets done before this one gets looked at but...
Final option is something similar to what @RossWilliams suggested above that of "going large" and streaming the table data from dynamo into a Relational database so you can do full on SQL magic on that. This one currently is beyond my pay grade, I laughed when I first read Ross' comment and it sounded crazy complicated but after a bit of reading I can see how it all might stitch together...but again...should someone have to do all that to get some stats or counters out of their data? The answer might simply be "you are using the wrong database in the first place!" and I should go with a relational database from the off...but that isn't how Amplify is marketed or sold to me!
The basic idea is that by hooking into a DynamoDB Stream we can push record changes to a relational database and then from there we can pull out whatever data SQL will allow us to do. Sounds excessive but if its what we've got then lets use it. Could another directive handle this one?
type Posts
@model
@streamToSQL { # Guess this would be RedShift?
title: String!
content: String!
author: String!
published: Boolean!
}
This would push any CRUD operations into a relational database, I guess you would need to add this database first as a resource via the CLI amplify add sql
. It would create a table name the same as the model (could allow you to over-write this in the directive).
What to do about relations and @connections
though. In theory it could include those and I guess that would be the dream.
This does raise the idea that do we need a @sql
resolver similar to the @http
resolver so we could quickly get the results of a sql query into our GraphQL? I'll leave that one for another issue I think ;)
@houmark raises the tricky point about authentication earlier and to be honest I don't know. I'm not all that sure if this is an issue with passing @auth
to elastic search or just within the api. I'm hoping its an elastic search thing which would mean that the above solutions should work with the possible exception of the relational database idea, I can see that needing a lot more thought.
It should be possible to apply @auth
to the model or the individual fields like normal which should cover it for the @scannedCount
and @runningCount
directives. Alternatively if we use a separate table like PostTotals
a discussed above then @auth
could be applied maybe to that? That should re-use the logic already in place I think?
Sorry this is so long, it wasn't meant to be, its a tough problem and I've been giving it a lot of thought as I hope you can tell. I'm hampered though by only knowing some of what is needed to know the best route but hopefully this adds to the conversation and brings a users point of view to how I'd personally like it to roughly work and what constraints I might be happy living with to get some of the functionality I and others say we need.
I've tried really hard to leverage what Amplify already has or uses so that it shouldn't be a massive stretch of the imagination or skillset to get it to do what it needs to.
Lets keep the conversation going, what have the Amplify devs come up with in the time this has been open, where might it go and how can we get there (and ideally an idea of when)?
tl;dr; - You should create a Lambda resolver for this field and share your code for others to start from. I don't see an automated solutions fitting in well to the project without significant effort and problems to overcome
@PeteDuncanson you are an excellent writer and I thoroughly enjoyed reading this. My thoughts and responses are a lot more scattered, but let me try to explain my crazy ideas and somewhat address your proposals. I agree the proposals I have made in previous comments are not ideal, but still believe they are the best we have given the path Amplify has taken.
At an even higher level, imagine a developer (e.g. me, or a start-up – this is the target audience of AWS Amplify, right?) who creates an app which is basically a social network. There people can follow each other, create posts and like and write comments for those posts. This is pretty basic scenario/logic. And for this kind of apps displaying follower, like and comment count is quite crucial. Of course, at the launch the numbers are not so big and fall under 1Mb, but then with growth – the more users, the bigger those numbers are. The developer would like to choose AWS Amplify because of the infrastructure, ease, integrity and velocity mainly which are so needed at the early stages of new projects (Free Tier is also much appreciated). But suddenly he has a dilemma, why doesn't such a robust tool support such a basic feature? Should he fight these limitations (and then maintaining them) or just search for a tool which just supports that or at least where it could be done without hacking?
@ilyagru this scenario is easily solved with a pipeline resolver or dynamodb stream. After adding a like, update the like count on the post. Where it gets tricky is paginated (even more difficult with filtererd) results
Does anyone on the AWS team even read this? So far I am missing an official statement on this topic. We have currently spent a lot of money on development and we are still faced with such a trivial problem.
I think the Amplify team only has two options. Either solve the problem or find an alternative to DynamoDB. It's not ok that nothing has happened on this subject for 13 months. Not even a responsible person has been assigned to this issue.
@Amplify Team, @aws-amplify-ops Please give us a statement.
@andre-scheffer (and others) this is good feedback and myself and @kaustavghosh06 have responded on this thread several times. We are both members of the Amplify team. The problem is non-trivial as others have noted in their descriptions above of potential implementations, all which we are looking into in the future which was noted earlier. As such we'll keep this thread open to continue to measure interest as well as potential thoughts on solutions such as those outlined by @PeteDuncanson, @duwerq , and @RossWilliams. Having further open dialogue here on requirements, DX options, and suggestions for implementation are welcome.
@undefobj, thank you for your answer.
The problem is non-trivial
I definitely believe you, as far as your part is concerned.
As such we'll keep this thread open to continue to measure interest as well as potential thoughts on solutions such as those outlined by @PeteDuncanson, @duwerq , and @RossWilliams.
and suggestions for implementation are welcome.
Did I get something wrong and Amplify is an open source project? We as paying customers of AWS should only supply the requirements. Sorry if that sounds arrogant, But this in an Amazon project and not any little community project.
So I have to ask again. What is your timeline on this issue? What is the Roadmap? Why is nobody assigned to this issue in GitHub?
Did I get something wrong and Amplify is an open source project? We as paying customers of AWS should only supply the requirements. Sorry if that sounds arrogant, But this in an Amazon project and not any little community project.
It is open source but it is also quite a small team. Each team in Amazon is different and run their own way. Lets be grateful we have a team on this who are engaging and open to ideas and not tar them with the "big evil rich company" brush.
Yes at the end of the day this tech exists to get us to sign up and spend money with them but then so are the apps we build with it. In the meantime they are trying to build the best stack they can but there are pressures to doing that that need some consideration no doubt.
For my part I'm slowly getting up to speed on the problem some more and seeing if I can't come up with a @runningCount
directive of my own, its a steep learning curve though. Hoping to get this up and running so I can dog food it a bit then offer it up here for discussion.
It is open source but it is also quite a small team
But someone at AWS needs to be responsible for this project? This someone needs to address issues like that. I don't understand why such fundamental problems have existed for 16 months. It's not good to see that there is such a long feedback loop without a final result like: 'Yes, we will fix that' or 'No, we won't implement it'.
open to ideas and not tar them with the "big evil rich company" brush
I fully agree, but it's still AWS with tons of resources. But yeah I stop complaining about it and try to find another solution like using an extra mirror database to do the math.
Hoping to get this up and running so I can dog food it a bit then offer it up here for discussion.
That would be awesome.
Found #2901 referenced in the docs (https://docs.amplify.aws/lib/graphqlapi/query-data/q/platform/js#frequently-asked-questions) so thought I'd link them together for anyone following the breadcrumbs
The thing I don’t understand here is if you don’t like dynamodb or it’s lack of counts, you have an option to use another database. Nader is about to put a video using a Postgres DB. I highly recommend it for anyone not interested in dynamodb https://twitter.com/dabit3/status/1328710735619268608?s=21
+1 @duwerq, there are definitely other things i'd rather have prioritized on their roadmap over this. There are plenty of other good db options.
@RossWilliams could you give an example of how to do a table stream lambda to implement the count. I am very new to amplify but really need this feature. Thanks
@snahammond Here are the steps. Send me an email, LAST_NAME.[email protected] if you get stuck.
At a high level, you are creating a lambda, connecting a lambda to the table change stream, adding permissions to allow the lambda to connect to the stream, and writing a lambda to handle create, update, and delete events. You are also going to create environment variables so you have the table names you need inside the lambda.
{
"category": "api",
"resourceName": "XXX_APP_NAME",
"attributes": ["GraphQLAPIIdOutput", "GraphQLAPIEndpointOutput"]
},
"DynamoTableXXX_TABLE_NAMEStream": {
"Type": "AWS::Lambda::EventSourceMapping",
"Properties": {
"BatchSize": 20,
"MaximumBatchingWindowInSeconds": 100,
"BisectBatchOnFunctionError": true,
"StartingPosition": "LATEST",
"Enabled": true,
"EventSourceArn": {
"Fn::ImportValue": {
"Fn::Join": [
":",
[
{
"Ref": "apiXXX_APP_NAMEGraphQLAPIIdOutput"
},
"GetAtt",
"XXX_TABLENAME",
"StreamArn"
]
]
}
},
"FunctionName": {
"Fn::GetAtt": [
"LambdaFunction",
"Arn"
]
}
},
"DependsOn": [
"LambdaExecutionRole"
]
},
```
3. Your lambda function policy needs the following statement for permissions. you will need to add permissions for any additional operations you use. Keep the existing permissions for log groups that Amplify generates:
```json
{
"Effect": "Allow",
"Action": [
"dynamodb:DescribeStream",
"dynamodb:GetRecords",
"dynamodb:GetShardIterator",
"dynamodb:ListStreams"
],
"Resource": {
"Fn::Sub": [
"arn:aws:dynamodb:${region}:${account}:table/XXX_TABLENAME-${apiId}",
{
"region": {
"Ref": "AWS::Region"
},
"account": {
"Ref": "AWS::AccountId"
},
"apiId": {
"Ref": "apiXXX_APP_NAMEGraphQLAPIIdOutput"
}
}
]
}
}
You can adjust the batch size and batch window I provided above.
"LambdaFunction": {
"Type": "AWS::Lambda::Function",
"OtherItems": "existing items here",
"Properties": {
"OtherItems": "existing items here",
"Environment": {
"Variables": {
"ENV": {
"Ref": "env"
},
"REGION": {
"Ref": "AWS::Region"
},
"XXX_DYNAMO_TABLE_FOR_UPDATES": {
"Fn::Join": [
"-",
[
"XXX_DYNAMO_TABLE_NAME",
{
"Ref": "apiXXX_APP_IDGraphQLAPIIdOutput"
},
{
"Ref": "env"
}
]
]
},
}
import { DynamoDBStreamHandler, DynamoDBStreamEvent, DynamoDBRecord } from "aws-lambda";
import DynamoDB, { AttributeMap, DocumentClient } from "aws-sdk/clients/dynamodb";
const documentClient = new DocumentClient();
const TABLE_NAME = process.env.XXX_DYNAMO_TABLE_FOR_UPDATES;
export const handler: DynamoDBStreamHandler = async (event: DynamoDBStreamEvent) => {
for (const record of event.Records) {
if (record.dynamodb.NewImage && !record.dynamodb.OldImage) {
const newItem = DynamoDB.Converter.unmarshall(record.dynamodb.NewImage);
if (newItem.XXX_INTERSTING_FIELD) {
await documentClient.updateItem({
TableName: TABLE_NAME,
Key: {
table_pk_attribute: XXX_A_PARTITION_KEY,
table_sk_attribute_if_exists: XXX_AN_OPTIONAL_SORT_KEY
},
UpdateExpression: "SET #xxx_my_counter_item = :xxx_my_counter_increase",
ExpressionAttributeNames: { "#xxx_my_counter_item": "xxx_my_counter_item" },
ExpressionAttributeValues: { ":xxx_my_counter_increase": newItem.XXX_INTERSTING_FIELD },
}).promise();
// TODO: handle failures
}
} else if (record.dynamodb.NewImage && record.dynamodb.OldImage) {
// old and new item, meaning record has been updated. check if you need to increase a count or not
} else {
// no new item, meaning record has been deleted. you may need to decrement a count
}
}
}
Most helpful comment
Associating the word "impossible" for a query count in 2019 makes me cringe a bit. And more than that, it makes me wonder if selecting Amplify and all its (current) dependents was a very wrong choice.
The fact that DynamoDB does not do counts for its queries (besides a full table estimated count every ~6 hours) is simply a limitation the team working on DynamoDB should solve. Every single competitor to DynamoDB handles this without issues, so I'm sure those smart people can also come up with a solution that does not just benefit AppSync and Amplify users, but also people using DynamoDB directly. Maybe it will be near correct counts if millions and more precise when thousands like MySQL / InnoDB, and that would be way way better than having no clue whatsoever.
I am aware that using the nextToken I can make pagination but that paginator is somewhat less cool to look at from a UX perspective as I won't be able to show 1, 2, 3, 4....12 because I don't know how many pages I have. When someone wants to know how many items to we have fitting this filter, it cannot be that I have to pull them all out (only the id field) in the leanest way, and then count the array client-side?
I'm sure AWS compares themselves in some ways with other GraphQL services like Prisma etc. and they don't seem to have a problem supporting this.
This is a DynamoDB limitation. Attacking a solution on top of that for AppSync is the wrong angle, this needs to end on the table of AWS DynamoDB developers so they can come up with a sensible solution nearer to the root of the problem — everything else is a hack. Asking me to keep counts in a model/table myself when things update is even worse and not what you'd expect of a platform with an otherwise impressive feature set.
And if it's not possible for DynamoDB to solve this, then the Amplify / AppSync team should start considering built-in support for other major database players such as MongoDB, MySQL, Postgres, etc. so they are not being held down by a half-baked database that is backing the entire thing, but when that is considered, I am sure it looks way more interesting to just figure out a solution to counts and other minor limitations DynamoDB currently has.