Aws-cli: Corrupted downloads when a file on S3 changes mid-download

Created on 13 Dec 2016  Â·  10Comments  Â·  Source: aws/aws-cli

We're observing behavior where aws-cli downloads a corrupt file from S3 if the file is replaced mid-download. We're thinking this happens because of multipart downloads -- each part fetched is consistent with some version of the file, but some of the parts are coming from different versions of the file.

In the end, we end up with aws-cli exiting zero but a corrupted file that produces errors further along in our processing.

Reproduction

This reproduces the issue almost 100% of the time:

  1. Make a virtualenv and install the latest aws-cli:
    bash virtualenv -ppython2.7 venv && venv/bin/pip install awscli
  2. Make eight files files f0 through f7 which each consist of 1GB of a single byte repeated:
    bash for i in {0..7}; do dd if=/dev/zero bs=1M count=1000 | tr '\000' "\00${i}" > "f${i}" done
  3. Upload some of these files fully, with other uploads still in-progress, to the same key, then start a download of that key using aws-cli. Here's one example script:

    #!/usr/bin/time bash
    set -euo pipefail
    n="$RANDOM"
    key="s3://my-bucket/test-${n}"
    results="result-${n}"
    
    # stagger the uploads a bit, start them all in the background
    for f in f*; do
        venv/bin/aws s3 cp "$f" "$key" &
        sleep 5
    done
    
    # wait for the first three to finish
    wait %1
    wait %2
    wait %3
    
    venv/bin/aws s3 cp "$key" "$results"
    my_hash=$(openssl sha1 "$results" | cut -d' ' -f2)
    echo "my hash is: $my_hash"
    
  4. Compare the hash of the uploaded file with the f files, and it's usually different:

    $ openssl sha1 result-3854 f*
    SHA1(result-3854)= 99db13b557cb00b7b15410bad1c360e89b530f58
    SHA1(f0)= cb19f836c2830ff88ff45694565da65be73b7a69
    SHA1(f1)= eee4fdda7e8ac4955b9d4b97fb823c07ba0f73b4
    SHA1(f2)= c4f79272572f3fd74800c2d7b83c936646475c2e
    SHA1(f3)= bc143c1ff8156c7ab8d41f4a700c1f2d16fbadb3
    SHA1(f4)= f035d33802e80e293f9cdcc307474809a6c45ad1
    SHA1(f5)= d406dcd45e1a85050ce0eaa34f708de4cf25b142
    SHA1(f6)= 13f9c9358cd0824395f2890aacdced2e93b33f27
    SHA1(f7)= 77bbea0dad5295f4e67c371825f0b1a857cc4b5b
    

    Looking at the hexdump, the downloaded file is a mix of f2, f3, f5, f6, and f7:

    $ hd result-3854
    00000000  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
    *
    0e800000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
    *
    16800000  05 05 05 05 05 05 05 05  05 05 05 05 05 05 05 05  |................|
    *
    1a800000  06 06 06 06 06 06 06 06  06 06 06 06 06 06 06 06  |................|
    *
    20800000  07 07 07 07 07 07 07 07  07 07 07 07 07 07 07 07  |................|
    *
    3e800000
    

I did these steps on Debian stretch.

Expected behavior

It'd be great if aws-cli could either download the file consistently, or at least detect that it has downloaded a corrupted file and exit nonzero to avoid propagating errors.

feature-request s3

Most helpful comment

@jamesls Please reclassify this as a bug report. It's not a feature request.

All 10 comments

Thank you for the deep dive here. Your explanation makes sense. The difficult part is that for ranged downloads, S3 currently does not return a hash of the contents for that individual ranged get so there is not really a way for us to check whatever we stream down is correct. It is sometimes possible to use ETag as that MD5 check, but the ETag is only the MD5 under certain conditions (i.e. not serverside encrypted, not a ranged get, not uploaded via the multipart API). Having a reliable/correct MD5 from the S3 to compare the streamed contents with would be ideal.

Another idea that could be pursued which would be more feasible and specifically handles your use case would be to always check the ETag for each download to make sure that it has not changed during the transfer process and error out if it did. If the ETag had changed during the download, then contents of the remote object must have changed midst transfer.

As to options available to you in the meantime, you could:

  • Update multipart_threshold in your config file to something really high so you never do ranged downloads. The downside being your transfers may be slower since any single individual object in S3 will be downloaded serially instead of in parallel with ranges.
  • Of course avoid uploading to that key, while performing a download.

Let me know what you think.

This doesn't seem like a feature request to me; the current behavior gives wrong results.
I don't believe hashing is part of the solution. Can we query the key version before the download, and pin to that version for each multipart request. This solves the bug without any hashing.

Yes hashing aside won't help for this scenario, but the ETag suggestion should work.

I do not think we would be able to query the version. The CLI uses ListObjects and HeadObject for querying the objects in the S3 bucket. However HeadObject only provides the version id where ListObjects does not. So if we started to query for the version id as well, it would slow down the transfer process as a separate API call would be needed for each object.

So I was on the fence on what to mark this bug/feature request; there's not really an in-between label. I marked it as a feature request originally because the CLI has not really supported the notion of being aware of when your source data is getting manipulated as the transfer is occurring. For example:

  • If you update a file during a sync, depending on when the CLI inspected the file/object it may or may not resync the file.
  • If you modify a local file as it is getting uploaded, the contents be different based on timing when the file got modified and the PUT object request was made.

However, despite the label, I think that this is something important we should address; if there is any safety mechanisms that we can put in place to help ensure data integrity they should be put in place.

An extra round trip would be preferable, to me. For large files, the overhead is negligible. Perhaps doing it only for large files (exactly those files for which we're doing multi-part downloads) would be acceptable.

That aside, just from reading the docs, HeadObject gives exactly the same data as GetObject, which is presumably used during the multipart download. Could we record the versions from each of those requests and compare afterward, in order to flag corrupted downloads? Or better, arrange it such that the version from the first request to complete wins; any requests completing with another version are re-sent for the correct version explicitly.

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

This is a bug report, not a feature request.

On Tue, Feb 6, 2018, 2:30 AM Andre Sayre notifications@github.com wrote:

Good Morning!

We're closing this issue here on GitHub, as part of our migration to
UserVoice
https://aws.uservoice.com/forums/598381-aws-command-line-interface for
feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it
easier to search for and show support for the features you care the most
about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is
posted, people can vote on the ideas, and the product team will be
responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this
issue there!

And don't worry, this issue will still exist on GitHub for posterity's
sake. As it’s a text-only import of the original post into UserVoice, we’ll
still be keeping in mind the comments and discussion that already exist
here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on:
https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/aws/aws-cli/issues/2321#issuecomment-363379849, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAnFSPbre_wRh87zku2Gjl8zKUfiiZ5mks5tSCm6gaJpZM4LMMzo
.

Based on community feedback, we have decided to return feature requests to GitHub issues.

@jamesls Please reclassify this as a bug report. It's not a feature request.

I ran into this for a largish file.

Is 'aws s3 mv' atomic and fast (metadata update) within the s3 as it is on a Linux system? If so, then a 'cp' that overwrites the object can be replaced with 'cp' from local to object.tmp on s3 and then equivalent of 'mv object,tmp object' within the s3. But if 'mv' is actually going to copy over the contents, then its useless.

Dang! Looks like 'mv' is just creating a new file by copying stuff even within the same folder of the same bucket. So, its useless!

But the 'mv' is significantly faster than 'cp' (within the S3 I mean).

Was this page helpful?
0 / 5 - 0 ratings