Hi,
We'd like to be able to search a bucket with many thousands (likely growing to hundreds of thousands) of objects and folders/prefixes to find objects that were recently added or updated. Executing aws s3 ls on the entire bucket several times a day and then sorting through the list seems inefficient. Is there a way to simply request a list of objects with a modified time <, >, = a certain timestamp?
Also, are we charged once for the aws s3 ls request, or once for each of the objects returned by the request?
New to github, wish I knew enough to contribute actual code...appreciate the help.
The S3 API does not support this, so the only way to do this just using S3 is to do client side sorting.
As far as S3 pricing, we use a ListObjects
request which returns 1000 objects at a time. So you will be charged for a LIST request per every 1000 objects when using aws s3 ls
.
Another alternative is to store an auxiliary index outside of S3, e.g dynamodb. Let me know if you have any other questions.
Thank you
Although this functionality appears to remain absent from aws-cli, its pretty easy to script it in bash. For example:
#!/bin/bash
DATE=$(date +%Y-%m-%d)
aws s3 ls s3://bucket.example.com/somefolder/ | grep ${DATE}
@jwieder This doesn't help user decrease number of list calls to s3. Say that every day you store ~1000 news articles in a bucket. Then on client side want to get articles for last 3 days by default (and more only if explicitly requested). Having to fetch a list of all the articles since the beginning of time, say 100k, takes time and accrues network costs (because a single list call will return only up to 1000 items). It would be much nicer to be able to say "Give me a list of items created/modified since 3 days ago".
Exactly!
On Sun, Jan 17, 2016 at 11:53 PM, PuchatekwSzortach <
[email protected]> wrote:
@jwieder https://github.com/jwieder This doesn't help user decrease
number of list calls to s3. Say that every day you store ~1000 news
articles in a bucket. Then on client side want to get articles for last 3
days by default (and more only if explicitly requested). Having to fetch a
list of all the articles since the beginning of time, say 100k, takes time
and accrues network costs (because a single list call will return only up
to 1000 items). It would be much nicer to be able to say "Give me a list of
items created/modified since 3 days ago".—
Reply to this email directly or view it on GitHub
https://github.com/aws/aws-cli/issues/1104#issuecomment-172425517.
@PuchatekwSzortach @ChrisSLT You're right, sorry for my lame reply; and I agree this sort of functionality would be very helpful in aws-cli. The combination of leaving this basic feature out and billing for file listings is highly suspect. Until AWS stops penny-pinching and introduces listing by file properties, here's another idea that I've used that is more relevant to this thread then my 1st reply: For files that need to be tracked in this way, files are named with a timestamp. A list of files is stored in a local text file (or could be db if you have gazillions of files to worry about). Searching for a date then involves opening the file, looking for filenames that match the today's date could look something like this:
while read -r fileName
do
if [ "$fileName" == "$TODAY" ]; then
aws s3 sync $BUCKETURL /some/local/directory --exclude "*" --include "$fileName"
fi
done < "$FILE"
Where $FILE is your local filename index and $TODAY is the date you are searching for. You'll need to change the condition on this loop, but hopefully this can give you an idea.
Doing things this way relieves you of any charges related to listing the files in your bucket; but it also depends on the client you are conducting the search on having access to the local file list ... depending on your application / system architecture that might make this sort of approach unfeasible. Anyway, hope this helps and apologies again for my earlier derpy reply.
Agreed and thank you
On Tue, Jan 19, 2016 at 10:00 AM, Josh Wieder [email protected]
wrote:
@PuchatekwSzortach https://github.com/PuchatekwSzortach @ChrisSLT
https://github.com/ChrisSLT You're right, sorry for my lame reply; and
I agree this sort of functionality would be very helpful in aws-cli. The
combination of leaving this basic feature out and billing for file listings
is highly suspect. Until AWS stops penny-pinching and introduces listing by
file properties, here's another idea that I've used that is more relevant
to this thread then my 1st reply: For files that need to be tracked in this
way, files are named with a timestamp. A list of files is stored in a local
text file (or could be db if you have gazillions of files to worry about).
Searching for a date then involves opening the file, looking for filenames
that match the today's date could look something like this:while read -r fileName
do
if [ "$fileName" == "$TODAY" ]; then
aws s3 sync $BUCKETURL /some/local/directory --exclude "*" --include
"$fileName"
fi
done < "$FILE"Where $FILE is your local filename index and $TODAY is the date you are
searching for. You'll need to change the condition on this loop, but
hopefully this can give you an idea.Doing things this way relieves you of any charges related to listing the
files in your bucket; but it also depends on the client you are conducting
the search on having access to the local file list ... depending on your
application / system architecture that might make this sort of approach
unfeasible. Anyway, hope this helps and apologies again for my earlier
derpy reply.—
Reply to this email directly or view it on GitHub
https://github.com/aws/aws-cli/issues/1104#issuecomment-172878454.
There is a way to do this with the s3api and the --query function. This is tested on OSX
aws s3api list-objects --bucket "bucket-name" --query 'Contents[?LastModified>=2016-05-20
][].{Key: Key}'
you can then filter using jq or grep to do processing with the other s3api functions.
Edit: not sure why they are not showing up, but you have to use backticks to surround the date that you are querying
Is it possible for you to create folders for each day and that way, you will be accessing only todays files or at most yesterdays folders to get the latest files.
yes. Although you may find it easier to simply use a date prefix for your keys (you cannot query a bucketname/foldername combination using the --bucket option). Using the date prefix will allow you to use the --prefix flag in the cli and speed up your queries as AWS recommends using numbers or hashes at the beginning of key names for increased response times.
@willstruebing, your solution still does not reduce the number of S3 API calls, server-side query complexity, or amount of data sent over the wire. The --query
parameter performs client-side jmespath filtering only.
@kislyuk I agree completely that is does not answer the efficiency issues. However, my intention was to answer the specific question:
Is there a way to simply request a list of objects with a modified time <, >, = a certain timestamp?
That basic question is how I ended up on this thread, and so I thought it reasonable to include an answer to it. The issue is labeled "aws s3 ls - find files by modified date?".
I would love to hear anyone's ideas on the efficiency parts of the question, as I don't have one myself and am still curious.
#for i in s3cmd ls | awk {'print $3'}
; do aws s3 ls $i --recursive ; done >> s3-full.out
What is the default for AWS returning files? Does it return them in alphabeticaly order, or by most recent modified, or what is the criteria that is uses when you request your first batch of 1000 file names?
I agree that there certainly should be some kind of filter (sort by date, by name, etct) that you can use when you request files... definitely a missing feature. :(
I agree this filtering should be server side and is a basic need.
+1 for server side querying/filtering
+1 for server side filtering
Still very needed indeed, +1
Agreed with @chescales and the rest, +1 to server side filtering
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
How is this not a feature already?
+100000
+1e999
+1
+1
+1
+1
+1
+1
+1
+1
+65535
@willstruebing's comment worked for me, e.g.:
aws s3api list-objects --bucket "mybucket" --prefix "some/prefix" --query "Contents[?LastModified>=`2018-08-22`].{Key: Key}"
oh nevermind - I see after watching the network traffic from this command that all the keys are still being downloaded from s3 and aws cli is doing the filtering client side!
+1
+1
+1
+1
what about the --exclude and --include filters?
DATE=$(date +%Y-%m-%d)
aws s3 ls s3://bucket.example.com/somefolder/ --exclude "" --include "${DATE}*"
+1
+1
+1 million
+1
+∞
+∞+1
+1
+1
+1
++
+1
+1
+1
+1 :( :(
I think that is part of the pricing model of AWS, super cheap storage but pay to access. Good for large files but will ruin you if you want to query/manage millions of small files.
+1
i guess this is why they created athena? another way to bill while adding some bells and whistles?
+1
+1
+1
i have to list the s3 bucket objects which are modified in between two dates ex. 2019-06-08 to 2019-06-11
any idea anyone?
aws s3api list-objects --bucket "BUCKET" --prefix "OPTIONAL" --query "Contents[?LastModified>='2019-06-08'][].{Key: Key,LastModified: LastModified}"
and then use JQ or your preferred tool to filter out after 2019-06-11
That doesn't eliminate API calls. Those queries are clients side
On Tue, Jun 11, 2019, 2:07 PM willstruebing notifications@github.com
wrote:
aws s3api list-objects --bucket "BUCKET" --prefix "OPTIONAL" --query
"Contents[?LastModified>='2019-06-08'][].{Key: Key,LastModified:
LastModified}" and then use JQ or your preferred tool to filter out after
2019-06-11—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/aws/aws-cli/issues/1104?email_source=notifications&email_token=AABLGMW5AFAU5BUNM7FEMZ3PZ7SV3A5CNFSM4A2VNZ2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXOALCY#issuecomment-500958603,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABLGMVTIZDPPIEUK2CZR6TPZ7SV3ANCNFSM4A2VNZ2A
.
@dmead I agree completely. However, the functionality to do server side filtering does not currently exist (I think that's why so many people end up on this particular post), so this is the only workaround that I know of to complete the task at hand. Do you have a way to do it server side or is this just an observation about the proposed solution? I'd love to hear input on how to do it AND reduce the amount of API calls.
If you have the time, i'd look into selecting on metadata in athena. I
haven't had the chance myself, but that seemed like a possible solution.
On Wed, Jun 12, 2019 at 10:28 AM willstruebing notifications@github.com
wrote:
@dmead https://github.com/dmead I agree completely. However, the
functionality to do server side filtering does not currently exist (I think
that's why so many people end up on this particular post), so this is the
only workaround that I know of to complete the task at hand. Do you have a
way to do it server side or is this just an observation about the proposed
solution? I'd love to hear input on how to do it AND reduce the amount of
API calls.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/aws/aws-cli/issues/1104?email_source=notifications&email_token=AABLGMTQZD6OWVH4KDMSJPLP2EBY7A5CNFSM4A2VNZ2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQTN3Y#issuecomment-501298927,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABLGMRLA5OYSYGEYNPUY5DP2EBY7ANCNFSM4A2VNZ2A
.
+24
Everyone upvoting this, filing it with AWS CLI doesn't help. AWS CLI is bound by S3. File with the S3 team rather than a tool's github if you want it fixed :P
@mike-bailey OK, and how do I do that?
If it were me personally I'd file an AWS ticket so it gets to the service team. But I don't work for AWS. I just know commenting '+1' on this isn't going to be the change.
There is a way to do this with the s3api and the --query function. This is tested on OSX
aws s3api list-objects --bucket "bucket-name" --query 'Contents[?LastModified>=2016-05-20
][].{Key: Key}'
you can then filter using jq or grep to do processing with the other s3api functions.Edit: not sure why they are not showing up, but you have to use backticks to surround the date that you are querying
Make sure you have the latest version of awscli
before trying this answer. I upgraded
awscli 1.11.47 -> 1.16.220
and it did the dreaded client-side filtering but it worked.
+1 for server-side filtering.
+1
+1
Please read the thread, +1 doesn’t do anything
You can't do this easily but buried in these comments is the following tip:
aws s3api list-objects --bucket "bucket-name" --query 'Contents[?LastModified>=`2016-05-20`][].{Key: Key}'
This is still client side and will perform plenty of requests.
As noted prior though, it handles it client side. So you still potentially slam the bucket with calls.
Filtering should be server side and is a basic need I think.
Here is an example using aws s3 sync so only new files are downloaded. It combines the logs into one log file and strips the comments before saving the file. You can then use grep and things to get log data. In my case, I needed to count unique hits to a specific file. This code below was adapted from this link: https://shapeshed.com/aws-cloudfront-log/ The sed command works on Mac as well and is different then what is in the article. Hope this helps!
aws s3 sync s3://<YOUR_BUCKET> .
cat *.gz > combined.log.gz
gzip -d combined.log.gz
sed -i '' '/^#/ d' combined.log
# counts unique logs for px.gif hits
grep '/px.gif' combined.log | cut -f 1,8 | sort | uniq -c | sort -n -r
# above command will return something like below. The total count followed by the date and the file name.
17 2020-01-02 /px.gif
9 2020-01-03 /px.gif
I know its an old issue but to leave a elegant solution here:
aws s3api list-objects --output=text --query "Contents[?LastModified >= <DATE_YOU_WANT_TO_START>
].{Key: Key}"
Most helpful comment
@jwieder This doesn't help user decrease number of list calls to s3. Say that every day you store ~1000 news articles in a bucket. Then on client side want to get articles for last 3 days by default (and more only if explicitly requested). Having to fetch a list of all the articles since the beginning of time, say 100k, takes time and accrues network costs (because a single list call will return only up to 1000 items). It would be much nicer to be able to say "Give me a list of items created/modified since 3 days ago".