Boto3: Streaming Uploads?

Created on 9 Sep 2015  路  13Comments  路  Source: boto/boto3

Hey,

Sorry for treating this as a mailing list, I didn't see any other method for contact, so I went ahead and opened an issue.

I'm trying to use boto3 to upload files uploaded to PyPI to S3. The majority of these files will be < 60MB but a handful of them will be larger (up to a few hundred MB in size). I'm trying to figure out what the right interface to use to do this is. Right now, in PyPI we have a streaming upload from the client along with the expected MD5 hash of the entire file once it's been uploaded. I'm wondering if I can do something like:


import hashlib

class HashingFileWrapper:

    def __init__(self, wrapped, md5_hash):
        self.wrapped = wrapped
        self.md5_hash = md5_hash
        self.hash_ctx = hashlib.md5()

    def read(self, *args, **kwargs):
        chunk = self.read(*args, **kwargs)
        self.hash_ctx.update(chunk)
        if not chunk:
            if self.hash_ctx.hexdigest() != self.md5_hash:
                raise ValueError("Hash Does Not Match")


my_s3_object.put(
    Body=HashingFileWrapper(file_like_object, md5_hash),
    ContentLength=file_size,
    ContentMD5=md5_hash,
)

Will that stream it up to S3 without buffering the whole file in memory? If not, is my only option to buffer the data to a temporary file and then use the my_s3_object.upload_file() interface?

question

Most helpful comment

Yes it does.

Although not (yet?) mentioned in Botocore's S3.Client.put_object() 's document, the Botocore S3.Client.put_object() does accept a file-like object. There is even a test case to ensure that. You won't find the streaming implementation in the code base here, because it is actually supported by the underlying library, requests.

Both Boto 3's Object.put() and Bucket.put_object() are calling Botocore's put_object(), so they support streaming as well. It is mentioned here.

The higher level S3Transfer in Boto3 provides more handy features. Its upload_file() accepts a filename, and it will automatically split the big file into multiple chunks with default size as 8MB and default concurrency of 10, and each chunk is streaming through the aforementioned low level APIs.

All 13 comments

Does it? It was documented as taking bytes so I wasn't sure if it was going to read the entire file object into memory and then upload it, or upload it streaming without loading the whole thing into memory. I'd normally just look at the code, but the auto generated nature of these libs makes that hard to do. Do I need to use the low level interface to get streaming, or does using the put method on a boto3 Object work too?

Yes it does.

Although not (yet?) mentioned in Botocore's S3.Client.put_object() 's document, the Botocore S3.Client.put_object() does accept a file-like object. There is even a test case to ensure that. You won't find the streaming implementation in the code base here, because it is actually supported by the underlying library, requests.

Both Boto 3's Object.put() and Bucket.put_object() are calling Botocore's put_object(), so they support streaming as well. It is mentioned here.

The higher level S3Transfer in Boto3 provides more handy features. Its upload_file() accepts a filename, and it will automatically split the big file into multiple chunks with default size as 8MB and default concurrency of 10, and each chunk is streaming through the aforementioned low level APIs.

It looks like S3Transfer only supports uploading a file that is currently on disk since you have to pass it a filename, is that accurate?

Yes, you are right.

Cool, thanks!

Just a heads up. Those aforementioned high level upload_file() and download_file() methods are now injected into boto3's S3 client, Bucket and Object. It is easier to use now. Check out the usage documentation here.

So afaik put_object does not work with a non-seekable stream... which makes me sad.

Sad, but true. Seekable input stream is needed when a chunk of uploading to S3 fails and Boto needs to rewind and retry.

Hi rayluo ,
Can we send actual data buffer as parameters instead of filename to upload_file () in boto3 ?

With put_object(), I am suffering with high memory footprint. Is there something cleanup call which I am missing after put_object() call ?

@amarpatil5060 make sure to close the File-like object after the upload has finished, otherwise it's memory buffer will remain

Hi guys,
I'm trying to do something like this:

  1. Download an S3 file into a BytesIO stream
  2. Pipe that stream through a subprocess.Popen shell command and its result back into another BytesIO stream
  3. Use that output stream to feed an upload to S3
  4. Return only after the upload was successful

So it's non-seekable, it happens in parallel and might need to wait for input to come in, and I don't know the size of the data.

I'm trying to do this with the two S3.Client.*load_fileobj() methods, but not sure how to block until the upload is complete. Any ideas?

The accompanying Stack Overflow question is here: http://stackoverflow.com/questions/42382693/boto3-wait-for-s3-streaming-upload-to-complete

Thanks!
Max

I was just digging into the source code and realized that the *load_fileobj() all block until the transfer is completed! See e.g. here.

In my case, I actually want more control over that. There's no point in streaming the whole thing through so many pipes if I block right after the first stream comes in. That would just mean everything got stored in memory and the next element of the pipeline has to wait.

I will probably copy and modify the functions so the future is returned to the caller, and they are free to resolve it any time they want. That should work in combination with my streams. But it would be great of boto3 to support this via a parameter, e.g. if the user passes block=False, instead of the result, the promise is returned.

Was this page helpful?
0 / 5 - 0 ratings