There seems to be nothing to prevent S3Transfer to upload any object that supports read, seek, tell and size, so I would like to suggest an abstraction that receives a file object.
e.g. the existing code would defer to new one that is more generic
def upload_file(self, filename, bucket, key,
callback=None, extra_args=None):
self.upload_file_object(open(filename, 'r'), bucket, key,
callback=callback, extra_args=extra_args)
I agree that it would be good to support file-like objects, though it wouldn't be as simple as a single function alias. We'll look into adding it to our backlog, thanks for reporting!
It's not too difficult, I think. I'm working on a pull request for this:
https://github.com/boto/boto3/compare/develop...grischa:s3-upload-file-objects?expand=1
It's not done, just posting, to let you know that I'm working on it.
I cannot see a way for both file-like-object abstractions and OSUtil abstractions to share much code, unfortunately. Is there an important reason for OSUtil, or did someone just think Python doesn't provide enough OS agnosticism?
In particular the fact that OSUtil.open_file_chunk_reader is abstracted makes things difficult.
We should also add support for nonseekable streams. As brought up in this issue: https://github.com/boto/boto3/issues/518
@grischa Why would you seek to 0? Shouldn't I be able to pass a file-like object that I have already seeked to the point that I want to start from? That seems like more expected default behavior when I look at models like the read method of file-like objects.
Good point, @wt.
However, I haven't been able to do any further work on this. Using boto2 instead was the easier option for my purposes for now.
If I were to start again, I would not even calculate the file size though, just do multipart by default when no size is given and increase chunk size gradually, as total file size and number of chunks increases. I got the idea from a suggestion on StackOverflow and it makes it possible to write a stream-based file-like-object to S3. If the file object had been sought to some non-zero point already, it would just use that, I guess.
+1 for me on this feature, it's annoying have to write stream to disk when ideally I'd like to pass it straight thru to S3
btw can also be achieved by replacing the OSUtils class like so:
class ChainedFileOSUtils(object):
def get_file_size(self, chained_file):
assert isinstance(chained_file, ChainedFile)
return chained_file.getsize()
def open_file_chunk_reader(self, chained_file, start_byte, chunk_size, callback):
assert isinstance(chained_file, ChainedFile)
# we need a cloned chained_file for each chunked reader as each will change the underlying offsets independently
file_size = self.get_file_size(chained_file)
return boto3.s3.transfer.ReadFileChunk(chained_file.clone(), start_byte, chunk_size, file_size, callback, True)
def open(self, filename, mode):
assert False
client = boto3.client('s3', region_name=region_name, **connectOpts)
config = TransferConfig(multipart_chunksize=chunkSize)
transfer = boto3.s3.transfer.S3Transfer(client=client, config=config, osutil=ChainedFileOSUtils())
transfer.upload_file(chained_file, bucket, key)
@thehesiod thanks for that example - I was able to use that along with GzipFile and BytesIO to put together a nice pipeline that streams files from a legacy service and compresses and uploads them to S3 without having to write to disk. Even got it all running within Lambda.
@brandond there was a slight issue in my example, I've already updated it. In open_file_chunk_reader you need to clone the file object "chained_file" since each copy will modify the underlying offsets independently. For other readers replace "chained_file" with a io.RawIOBase subclass.
@thehesiod thanks, I'll use that. The files I've been handling so far are all under the 8MB default chunk size, so I haven't run into any issues with multiple concurrent reads to the buffer.
I have a query.
Suppose client is uploading a huge file to a file server.
Now on file server if I try to use the file object from http request and simultaneously upload it to s3 without writing it to disk, will at any point I run out of memory ? Because everything will be in memory.
If I write the file to disk in chunks as its being uploaded and then upload the file to s3 by reading it in chunks or directly use s3cmd (which does that internally) i will not hit above issue.
Which is the ideal approach for my use case ?
@Sidhesh-telsiz That entirely depends on how much memory you have available and how you are buffering the file. My approach was to stream the response body into an BytesIO buffer, and then pass this into S3Transfer using a derivative of ChainedFileOSUtils to process the buffer. This is pretty memory-intensive though, as you end up with at least two copies of the entire body in memory - the original copy, plus one more for each instance of ReadFileChunk.
A less memory-intensive but more complicated approach might be to write yourself a HTTP IO wrapper that used byte serving (RFC 7233 Range Requests) to read the individual chunks on-demand. You still have to buffer entire chunks though, as various bits of the code expect to be able to seek around within the handle.
The ability to upload files from file-like objects is now supported in boto3 1.4.0, I would recommend reading: https://boto3.readthedocs.io/en/latest/guide/s3.html#uploads and using one of the upload_fileobj() methods to upload file-like objects.
I also noticed that there were discussions of implementing OSUtils so it would be good to take a look at the upgrading notes: https://boto3.readthedocs.io/en/latest/guide/upgrading.html before upgrading as functionality may have changed based on assumptions made in custom implementations of OSUtils.
Otherwise, resolving issue as the functionality is now available.
@kyleknap does the implementation take into consideration memory concern I mentioned above ?
Is there any kind of chunking done ?
@Sidhesh-telsiz
It does do chunking. Here is the underlying parameter that controls the max number of chunks to ensure you do not run out memory. Essentially what the library does is read from the input stream one chunk (~8MB) at a time and waits to read any more chunks to be delegated to worker threads that do the upload part if the current in memory chunks being uploaded is at its maximum.
Thanks @kyleknap
Will take a look at the pointer and use it.
Hi Kyle,
Documentation for this api says the following:-
with open("tmp.txt", "rb") as f:
s3.upload_fileobj(f, "bucket-name", "key-name")
However if the file is not located on the disk then using with throws an exception.
Error: [Errno 2] No such file or directory:
In my case I upload a file from local m/c to server and
while the file is in memory I start using s3.upload_fileobj.
If I directly use the file object which I get from request.FILES.items() dictionary it works fine.
File object for me is of type InMemoryUploadedFile.
I think with expects the file to be locally available and would throw exception otherwise.
Is my understanding correct.
@Sidhesh-telsiz
So for the example in the docs, it is just showing how to use upload_fileobj() with a file-like object created from using open(). It is not a requirement that the file-like object was created with open(), and I think that is what is causing your error as the file does not exist on disk. upload_fileobj() should work with any file-like object that has a read() and produces binary so you do not necessarily need to use open() or the with context manager.
@kyleknap
Thanks for the clarification. Its working without using 'with'.
Will continue to use that.
@kyleknap
Does this api also takes care of matching the md5 of the file being transferred once the file is transferred just to make sure the file was not altered during the transfer and is valid.
What is a chained file here?
Most helpful comment
+1 for me on this feature, it's annoying have to write stream to disk when ideally I'd like to pass it straight thru to S3