It's very easy for people to miss that --exact-timestamps
is necessary for aws s3 sync
to produce correct results, especially since the default settings are the opposite of what common tools like rsync use. --exact-timestamps
should be enabled by default so sync
matches common English usage and the description in the man page which is currently incorrect:
Recursively copies new and updated files from the source directory to the destination.
If someone has an odd case where this is an optimization the current behaviour could be offered as a --no-exact-timestamps
or, better, following rsync's lead with --size-only
to make the implications more obvious.
A newer S3 timestamp does not necessarily mean a newer version of the file, since S3 does not allow setting the timestamp like a filesystem does. While doing exact timestamps for same-sized files would not be wrong, it would very often be slow and expensive since you would always be downloading.
Right now we have a pattern of spending as little of customer resources as possible. In general this is a great philosophy, but in the case of sync in particular we could have a better experience by being more liberal with requests. In this case we could actively set our own timestamps and/or checksums in the object metadata and then compare those to determine if a sync should happen. This would still be slow and expensive, but possibly less so than just downloading every time depending on the use case.
In any case, we won't be able to change the default without a major version bump. We'll keep this frustration in mind when we get to that time. Thanks for the feedback!
Correcting the documentation would also help. It specifically mentions handling updated files in the description without noting the caveats.
What about calculating a hash from both local and S3 files instead (or in addition to) the file size, and in case of different values, copy the newer to replace the older?
@froblesmartin This is complicated by the way S3 generates ETags for multi-part uploads, so you either need to implement the same hash on the local side — which does at least save needing to recalculate the hash for the S3 object — or add a custom header, which would have the advantage of not tying you to S3's implementation and could be compatible with existing hashes if you happen to have them. The systems I've designed typically use that approach so we can say something like X-Original-SHA512: …
and be able to compare it with local manifests.
This is a huge gotcha for anyone familiar with rsync's behavior. S3's sync and copy functions behave differently enough from standard nix tools that I have to write wrappers around them to correct their behavior. I don't see any way of changing this now as the tools are established. When we someday abandon s3 and move on to the next better thing please maintain uniformity with similar and established tools.
A s3 object will require downloading if the size of the s3 object differs from the size of the local file, the last modified time of the s3 object is newer than the last modified time of the local file, or the s3 object does not exist in the local directory.
aws s3 sync s3://mybucket .
(source: aws sync docs)
as far as I understand "or" means any of the conditions mentioned has to be true for sync to require downloading. am I reading this wrong?
@JohnLunzer would you be open to open sourcing your wrapper? I (and I suspect the community) would be very interested in your enhancements!
Most helpful comment
This is a huge gotcha for anyone familiar with rsync's behavior. S3's sync and copy functions behave differently enough from standard nix tools that I have to write wrappers around them to correct their behavior. I don't see any way of changing this now as the tools are established. When we someday abandon s3 and move on to the next better thing please maintain uniformity with similar and established tools.