There have been a few issues with respect to the sync
command, particularly in the case of syncing down from S3 (s3 -> local
). I'd like to try to summarize the known issues as well as a few proposals of possible options, and give people the opportunity to share any feedback they might have.
The sync behavior is intended to be an efficient cp
; only copy over the files from the source to the destination that are different. In order to do that we need to be able to determine whether or not a file in s3/local are different. To do this, we use two values:
stat
'ing the file locally and from the Size
key in a ListObjects
response)LastModified
key in a ListObjects
response)As an aside, we use the ListObjects
operation because we get up to 1000 objects returned in a single call. This means that we're limited to information that comes back from a ListObjects
response which is LastModified, ETag, StorageClass, Key, Owner, Size
.
Now given the remote and local files file size and last modified times we try to determine if the file is different. The file size is easy, if the file sizes are different, then we know the files are different and we need to sync the file. However, last modified time is more interesting. While the mtime of the local file is a true mtime, the LastModified
time from ListObjects
is really the time the object was uploaded. So imagine this scenario:
aws s3 sync local/ s3://bucket/
sleep 10
aws s3 sync s3://bucket local/
After the first sync command (local->s3
), the local files will have an mtime of 0, and the contents in s3 will have a LastModified time of 10 (using relative offsets). When we run the second aws s3 sync command, which is syncing from s3 to local we'll first do the file size check. In this case the file sizes are the same so we look at the last modified time checks. In this case they are different (local == 0, s3 == 10). If we were doing a strict equality comparison then, because the last modified times are different, we would unnecessarily sync the files from s3 to local. So we can say that if the file sizes are the same and the last modified time in s3 is greater (newer) than the local file, then we don't sync. This is the current behavior.
However, this creates a problem if the remote file is updated out of band (via the console or some other SDK) and the size remains the same. If we run aws s3 sync s3://bucket local/
we will not sync the remote file even though we're suppose to.
Below are potential solutions.
aws s3 sync local s3://bucket && aws s3 s3://bucket local
will unnecessarily sync files. However, when we download a file, we set the mtime of the file to match the LastModified time so if you were to run aws s3 sync s3://bucket local
_again_, it would not sync any files.If there any other potential solutions I've left out, please chime in.
Rather than store cache of server-provided ETags, can the ETag be calculated on the client side? Then it would be almost like doing md5 checks, but using data available in the ListObjects response. As long as the ETag algorithm doesn't rely on server-side state…
You found it :-)
-- Seb
On 17 Jan 2014, at 16:26, Jeff Waugh <[email protected]notifications@github.com> wrote:
Ah: https://forums.aws.amazon.com/thread.jspa?messageID=203510&state=hashArgs%23203510
—
Reply to this email directly or view it on GitHubhttps://github.com/aws/aws-cli/issues/599#issuecomment-32668382.
Amazon EU Societe a responsabilite limitee, 5 Rue Plaetis, L - 2338 Luxembourg, R.C.S Luxembourg n B 101818, capital social: EUR 37.500. Autorisation d establissement en qualite de commercante n 104408.
Yep, we can't reliably calculate the ETag for multipart uploads, otherwise that would be a great solution.
Could you add a new flag (or 2) for the time behaviour? Perhaps --check-timestamps
for option 1 and --update-local-timestamps
for option 2. That way the user can specify a more robust check for changes and accept the consequences at the same time.
Yeah, I think adding flags for options 1 and 2 would be a reasonable approach. One potential concern is that the default (no specified options) behavior has cases where sync
doesn't behave how one would expect, but I'm not sure changing the default behavior to either of these options is a good thing here, given the potential tradeoffs we'd be making.
@jamesls I'm using the sync command to deploy a generated static website.
With the current version, I'm re-uploading all files every sync because the mtime changes when the site is regenerated, even if the content does not.
For my purposes (and I imagine a healthy number of other folks using this fabulous tool to upload their static sites) syncing via ETag as suggested in #575 would be most awesome, but given my reading of that issue it doesn't seem to be an option.
Barring that, for the purposes of static sites, a length only check (though maybe slightly dangerous) would work.
Another option would be for us to disable multi-part uploads and use #575 - we'd see huge saving immediately.
I have found the reverse problem. I changed a file in S3 that has the same size but newer timestamp and s3 sync doesnt pull it down
aws s3 sync s3://bucket/path/ dir
Looking at the data in S3 ... I think that its because of timezone issues.
The Properties show a time of
Last Modified: 2/21/2014 10:50:33 AM
But the HTTP headers show
Last-Modified: Fri, 21 Feb 2014 15:50:33 GMT
Note that the Last Modified property doesnt show the timezone ?
Since my s3 sync command is running on a different server with different timezone from where I put the file it thinks the file is in the past and doesnt pull it.
I had to change to s3 cp to make sure it gets all the files
I think as a first step, we should implement the --size-only
argument. It doesn't solve the problem in the general case, but for certain scenarios this will help and it's easy to understand/explain, particularly the use case referenced above with static sites being synced to s3.
I think, sync should have an option to always sync files if the file to sync is newer than the target. We are syncing files from machine A to S3 and afterwards from S3 to machine B. If the size of a file does not change (but the content does), this file will not reach machine B. This behavior is broken. I do not car eif I sync to much files but changed files should never be left out.
As per my previous post, "newer" needs to take into account the timezone as well.
Currently it is not so if you pushing a file to S3 from one timezone then syncing from another it wont correctly detect that the file is newer .
@jamesls Further to the --size-only argument, I would be interested in using a --name-only argument. That is, don't check either the File Size or Last Modification Time. Simply copy files that exist in the source but not in the target. In our scenario we sync from s3 to local and once we have downloaded a file we don't expect it to ever change on s3. If this option resulted in fewer operations against our local (nfs) filesystem it could yield a performance improvement.
@jamesls Should --size-only
et al be available in 1.3.6?
My AWS Support rep for Case 186692581 says he forwarded my suggestion following to you.
I thought I would post it here anyway for comment:
I think a simple solution would be to introduce a fuzz factor.
If it normally wouldn't take more than 5 minutes for the local -> S3 copy,
then use a 10 minute fuzz factor on subsequent time comparisons.
Treat relative times within 10 minutes as equal.
If the S3 time is more than 10 minutes newer then sync from S3 -> local.
Perhaps add "--fuzz=10m" as an option.
@jamesls @adamsb6
Wouldn't be https://github.com/aws/aws-cli/pull/575 a good option at least for single part uploaded files?
If you check the ETAG Format of the file on S3, you could differ whether the file was uploaded as single ("ETAG = "MD5 Hash") or multipart (ETAG = "MD5 Hash" - "Number of Parts"). So you could compare all files local MD5to their ETAG and in the case that a file was uploaded as multipart you could skip it.
We've got a customer that has lots of video clips in certain folders on an S3 Bucket, that are synced to ec2 instances in all AWS Regions. All files are uploaded as single part.
In the moment we've got a problem due to s3cmd, that on some instances some files are corrupted. If we would do a full sync again we'll have 14 TB Traffic that will be charged.
Our problem: The corrupted files have exactly the same size like the original file on s3 and due to wrong timestamps through s3cmd we can't use the options mentioned above. In this case the --compare-on-etag
would be a great solution to prevent syncing all files again.
Even for normal syncing the --compare-on-etag
Option would be great, if you just have single part uploaded files, because aws s3 sync will only sync changed files.
I've just spent the better part of 3 hours attempting to find the minimum permissions required to use the sync command. The error I was getting was:
A client error (AccessDenied) occurred when calling the ListObjects operation: Access Denied
When really the error should have been:
A client error (AccessDenied) occurred when calling the ListBucket operation: Access Denied
A help item which shows a table with the minimum permissions for each command would be _very_ helpful.
Edit: To clarify, add rsync like behaviour to "aws s3 sync". It seems that that issue as reported is not quite what I initially understood it to be.
Since the latest AWS-CLI-bundle.zip does not contain the fix implemented above, I did a git clone. I can see the new code in a folder called "customizations". However, it is not clear to me how to create an aws-cli command using this code. Do I have to run make-bundle?
Yep. I use the following steps to install it onto new servers (Ubuntu):
git clone https://github.com/andrew512/aws-cli.git
cd aws-cli
pip install -r requirements.txt
sudo python setup.py install
OK.
I see the modified code in version 1.3.18.
It accepts my --exact-timestamps parameter.
I thought the latest download bundle I had previously installed was 1.3.21.
Reliable versioning will really only apply to the official AWS releases. I forked the repo at 1.3.18 so that's the version it will report, but it's already a few versions out of date, with 1.3.22 being the most recent as of right now. Hopefully AWS accepts the pull request and includes the feature in future official releases. It's been very valuable to us and helps address a pretty questionable default behavior.
@andrew512 Sorry for the delay. I think the PR you've sent is a good idea, and it's really helpful to have customer feedback regarding what aws s3 sync
changes work for them and what don't. I'll take a look shortly.
I think...for those of us who don't mind the header requests, the comparison on MD5 should be an option. I would (secondly) vote for the --compare-on-etag as cos I update only from one server to S3 -- and thus a local MD5 repo is not an issue for me. BUT I definitely think that we need to have something. As it is, I am NEVER sure my local and S3 repos are the same. Where are we on the status of something like this?
@janngobble +1
Our use case is we have these files in a git repo and they are configuration files so neither date modified nor file size really work so we'd like to see an actual md5 option for those that can handle the performance implications.
This is because when you check out a git repo the file modified date is when the file is written. Also file size does not work because the file change may be something like :
foo="bar"
to
foo="baz"
so the file would not change size.
@jamesls Why can't you use the method here to calculate the md5 for multipart uploads? It worked for me.
Hello,
I also have this issue well described with foo="bar"/foo="baz".
I use S3 bucket for my application deployment and all servers sync from S3 when a deploy is done. I had few times the problem of an operator >= changed to <= in a file not synced due to this bug and for me sync command is not very reliable due to this bug. The size is the same but the content is different, the file should be synced.
I have no particular advice about how to do it, sorry for that, but I'm just exposing my usecase :)
Go figure, I came across the same issue while developing node-ftpsync. I assumed AWS would have some magical solution to solve this.
It's probably a good idea to 'fuzz' it (ie rounding to the nearest 10 minutes) like @ngbranitsky suggested. Doing so in node.js is a pain but in python it should be as easy as truncating the last few bits using a bitwise AND.
Since AWS doesn't have that issue you also have to consider how mtime is used on the local host. By changing the mtime value on every sync are you going to trigger a mass update event if there are transpilers watching those files? Are there other metadata stores that use mtime as a metric for measuring file changes?
I realize this is not a general solution, but it would be very nice to see the following optional behavior for syncing, imo. It's sort of like no. 4 in the OP, only modified for better performance.
The use case here is syncing web map tiles, of which <1% normally change on a daily basis. The exception is when deletes occur in the source data that change the spatial coverage of the tiles, which necessitates the entire set of map tiles be regenerated.
The issue is the sheer volume involved. I see discussion about large multipart uploads as a use case, but not about many individual small files. How many? We have ~2m currently, but it can get much worse. For example, at zoom level 16, the world has 1 << 16 x 1 << 16, or 65536 x 65536 tiles, or ~4b.
Current options are:
I could write C code without much difficulty that would walk a directory path, update a local sqlite3 cache DB, and build the potentially changed set / upload queue. Unfortunately, I have no python experience and cannot submit this as a pull request for an optional sync behavior.
I don't know how common the "many small files, guaranteed no out-of-band changes, local->server only" use case is.
I find it unintuitive and dangerous that sync treats a difference in last modified times as a reason to update, even if a newer file at source is replacing an older at the destination. This odd behavior should be documented.
Also related to #404.
What about adding a single, self-documenting --sync-if
flag instead of the growing number of non self-documenting options? (e.g. --size-only
, --exact-timestamps
. Can I apply these two at the same time? Why do I have to read the documentation/try them to figure this out?).
The --sync-if
could have a list of options:
--sync-if newer-timestamp,different-md5,different-timestamp,different-size,deleted,...
The user could specify one or more (comma separated list) and if the file meets any of the criterias, it will be updated (uploaded/deleted) at the destination.
This would clarify the behaviour greatly, especially if you'd mentioned in the documentation that the default behaviour is: --sync-if different-timestamp,different-size
.
Reading through this issue, I can't figure out if this sync behaviour has been fixed yet.
I want something that works as simply as rsync -avz
to sync my local build with those on the server.
I had been using aws s3 sync
, but because I have one file which is large (a help file that is a movie) and my build step creates all files from new, it copies all files every time and was needlessly slow to update the site.
I then started using --size-only
to speed it up. Unfortunately, this bit me in the butt recently. I had renamed a file, and the build step includes a list of files in the service-worker.js file so the sw knows what to cache. Unfortunately, the new and old file names were the same length, so it didn't update the service-worker.js file, and kept giving a 404 for the old filename. It took me quite a while to figure out what was going on.
It does seem like this is a solved problem in other environments - ie by rsync - though, tbh, I'm probably somewhat ignorant of the challenges presented with doing something like this on S3. Anyway, having being bit by this recently, I'm looking at other clients, but they seem to have dependencies that I'm not interested in adopting just for this functionality.
Would be great to have an etag sync option. I know that there are scenarios where it fails, but for me it would be super valuable.
Good Morning!
We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.
This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.
As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.
We’ve imported existing feature requests from GitHub - Search for this issue there!
And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.
GitHub will remain the channel for reporting bugs.
Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface
-The AWS SDKs & Tools Team
This entry can specifically be found on UserVoice at : https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168808-aws-s3-sync-issues
Moving away from github issues? That seems like a mistake...
Agreed. It seems more like the method Microsoft use for judging the importance/impact of issues, but I find it quite irritating.
Based on community feedback, we have decided to return feature requests to GitHub issues.
Bumping this back up!
To the top!
comparison of md5 would be great, I'd also add that it would be helpful to output md5 on upload or download. this could be stored in our own db and would help determining if sync is needed through our database to limit requests.
@jamesls Could you comment this issue please?
https://github.com/aws/aws-cli/issues/4460
Most helpful comment
Bumping this back up!