Aws-cli: s3 sync not syncing files locally with that have the same size but have been modified

Created on 30 Dec 2014  路  16Comments  路  Source: aws/aws-cli

I originally posted this as a reply to #406 but I think its worth posting as a new issue as it seems like its a relatively straight forward but significant bug.

When doing an s3 sync from s3 to local, newer files (based on modified time) on S3 won't sync to local if the file size is the same.

It seems like its just because the compare_time() function is wrong.

In aws-cli/awscli/customizations/s3/syncstrategy/base.py, lines 207 - 223 (as of 9f56e8f):

    if cmd == "upload" or cmd == "copy":
        if self.total_seconds(delta) >= 0:
            # Destination is newer than source.
            return True
        else:
            # Destination is older than source, so
            # we have a more recently updated file
            # at the source location.
            return False
    elif cmd == "download":

        if self.total_seconds(delta) <= 0:
            return True
        else:
            # delta is positive, so the destination
            # is newer than the source.
            return False

There shouldn't be any logic difference between the cmd == "upload" or cmd == "copy" or cmd == "download". Its comparing source file and destination file times, and so all it needs to know is if the source file is newer than the destination or not.

i.e. Change the above block of code just to:

       if self.total_seconds(delta) >= 0:
            # Destination is newer than source.
            return True
        else:
            # Destination is older than source, so
            # we have a more recently updated file
            # at the source location.
            return False

Seems to fix the problem.

This same code was also in an issue in earlier versions where the sync strategies weren't split out, and the comparison code was in comparator.py

feature-request s3sync

Most helpful comment

This is a really nasty bug and lead to an error in production that took me days to track down. Basic tools like this should just work without users having to know implementation details. Why doesn't sync use something sane like a checksum instead of a timestamp or file size?

All 16 comments

@Bazman

The main reason we have that logic is so that the sync can be round tripped. So for example given the CLI's current logic, lets begin with some file test that has a last modified time of Dec 30 10:57:36 2014 and the empty bucket s3://mybucketfoo. If we do a sync, it will appropriately upload the file:

$ aws s3 sync . s3://mybucketfoo
upload: ./test to s3://mybucketfoo/test

Then when we check the bucket, the file's last modified time will be newer than the local file:

$ aws s3 ls s3://mybucketfoo
2014-12-30 11:09:49         15 test

So if we try to resync it appropriately does not re-upload the file:

$ aws s3 sync . s3://mybucketfoo

Now if we try to sync via downloading, it will also not download the file:

$ aws s3 sync s3://mybucketfoo .

If for downloads we use the same logic as for uploads, the sync will always download the file because the last modified time of the S3 file (which is the source for this sync) will be newer than the local file (which is the destination for this sync), even though neither file has changed. If we were able to explicitly set the last modified time of the S3 object, we could get the your proposed logic to work.

Given the round trip reasoning, we are going to keep the default sync strategy logic as is. If you do not mind me asking, what was the use case for the sync strategy logic that you were proposing? We could possibly add it as an additional feature/parameter.

I thought there must be a reason as it seemed too obvious, and I understand now why its related to #599.

Our use case is we sync some large directories from a master EC2 server to S3. Then there is a cloud of EC2 servers that regularly sync down from S3. With the current strategy, this basically doesn't work, as files that have changed but are the same size don't sync down to the cloud.

Thanks! This will have to be a feature request. We will look into/consider adding such a sync strategy.

@kyleknap , we found using the --exact-timestamps solves our problem. Not sure if its meant to be the solution to this is or not, but it seems to work perfectly.

Thanks.

That's right. I forgot about that option. Good to hear that it worked out for you. Closing issue.

I have Bazman's same use case - we sync up to S3, and then sync from S3 down to a different machine. I don't think Exact Timestamps solves it.

In reading the documentation I'm concerned about the AWS sync logic - are you saying that the default behavior when syncing down from S3 is that an older S3 version will replace the newer local version?
"--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version."

I think I see what your'e getting at but by and large it seems confusing to have a sync logic in which an older file replaces a newer file.

Maybe we could look at how rsync handles this kind of situation?

Thank you

Thinking about this more I see how it could be useful to have s3>local overwrite newer files on the local end. For example if a user changes something locally and you want the s3 content to refresh/overwrite the user's content.

However, I still hope that there can be an option to do the above as well as an option to local>s3 and s3>local behave the same way (wherein a file goes from source to destination only if it's newer in the source, or doesn't exist, etc.)

Thank you!

This is a really nasty bug and lead to an error in production that took me days to track down. Basic tools like this should just work without users having to know implementation details. Why doesn't sync use something sane like a checksum instead of a timestamp or file size?

It is 2018 and this is still a bug. In short, if you modified a file locally then ran aws s3 sync $myBucket $localDirectory. The older file in S3 will overwrite the newer file regardless of whether or not the sizes match. To test run:

aws s3 sync $myBucket $localDirectory
touch -am $localDirectory/$localFile
aws s3 sync $myBucket $localDirectory

The file that was modified after the bucket file will then be overwritten and the modified date on the local file will now match the older S3 bucket file's modified date. This is contrary to the documentation for the sync function provided by AWS: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Just to make the use case practical we had a diff in a properties file of

@@ -1,4 +1,4 @@
-server.ssl.client-auth=need
+server.ssl.client-auth=want

This change led to the same size property file, so the sync from the s3 bucket to the local machine failed to get the change. Obviously very different behavior.

I'm being effected by this due to different timezones between the uploader and downloader.

It seems that the explanation given in the first answer might be obsolete: when I sync a file from S3 to local, the local file then has the same last-modified time as shown on S3. I need a strategy opposite to the default one: when syncing from S3 to local, if S3 is newer, update the local file.

@kyleknap would you mind reopening this issue or point me to a more recent duplicate of this issue, if any?

We are still experiencing this problem, a local file index.html gets updated with only an hash and that makes the file stay the same size. The sync command does not upload the file.

Any update from the code maintainers?

I have the same problem. Need to sync data from local machine to S3. Request the ticket to be reopened.

I have the same problem.
At the very least, awscli should make sure that the documentation is consistent with the code.

Was this page helpful?
0 / 5 - 0 ratings