If I start an s3 upload, call abort(), and then call send() on the uploader object, I can continue/resume a started upload. But, how can I continue (resume) an upload, if I close the tab and then don't have access to that object (i.e. we don't have the state of which parts are remaining to be pushed to s3)?
My general idea is that, if I start the uploader with the same file again, it should continue from the old state (but it actually starts from the beginning). For example, I can see from here https://github.com/TTLabs/EvaporateJS that they store the old state in browser local storage, so that you can resume an upload? Is something like that available in aws-sdk-js?
Since file names are not unique across time, I think this would be tricky to get right. If you replace a file with another of the same name, that change would be difficult to detect in a browser environment. Clues like file size and last modified date are not available in all the browser environments we support.
It would be helpful, though, to offer a helper for resuming a multipart upload in a generic way. (E.g., something that calls ListParts and resumes based on which parts have already been uploaded.)
@jeskew If I want to implement that by myself for my needs, by editing ('m guessing) AWS.S3.ManagedUpload, I would probably just need to save the required data to local storage, and then fill information such as "parts", "completeInfo", and other such properties in ManagedUpload, right? Or is it a much bigger tasks and I'm missing something?
(I'll be validating the file size, name, etc myself, so I don't expect issues for me there. Because we probably don't have to support as many browsers as you guys do).
It might be as simple as populating completeInfo with the completed part numbers and their associated ETags, but I would need to do some testing to be sure.
The SDK should probably support a generic way of picking up a multipart upload from a known state (e.g., how the PHP SDK supports creating a multipart upload from a state object) and provide a helper method that creates a state object from an upload ID (by calling ListParts and extracting the relevant data). If the SDK added that feature, then you could just store the upload ID in local storage.
@jeskew I'm running into this also, did you do the testing as described above?
Or perhaps there is something implemented in the main time?
Do we have any feedback on this ? @jeskew
I am 58.17% through rolling my own resumable multipart upload in the browser solution. I am assuming the necessity to save not only the UploadId, but the PartNumber and ETag of all the uploaded parts.
Has there been progress made to offer some built-in SDK help for resuming a multipart upload? I really like what @jeskew suggested:
...provide a helper method that creates a state object from an upload ID (by calling
ListPartsand extracting the relevant data)
@jeskew also pointed out the fragility of how to know the file the user selected is the exact same file they previously attempted to upload. I do plan to rely on a hash of file name, lastModified, and size. For my purposes and users, it will be a lesser-evil compromise to get the benefit of resumable file uploading.
I'm considering whether a custom implementation like @TroyWolf describes would be worth the effort. Did you manage to finish the 41.83% of the implementation? Would you mind sharing any lessons or pitfalls you encountered?
@ekuusela I am ashamed I took so long to reply here. I bookmarked this conversation wanting to really dig in and give you a good and useful response. I DID complete that project successfully for my client. It works incredibly well and has all the features I was hoping for.
I know you want some useful specifics! However, I would have to devote minimum 4 hours pouring over that code to provide an overview of the critical bits. It is a relatively complicated solution with client and server bits. I really want to do a write-up on this solution as it is something I worked really hard on and am proud of. :coffee:
A very high-level overview of the solution I implemented.
I created these upload API end-points. My API is Python, but you can translate this to your AWS SDK of choice.
/url
Uses boto.generate_presigned_url() to return the URL the client will use to directly upload a file to my S3 bucket.
/multipart/create
This endpoint takes the file info as input: name, type, size. It also takes an optional UploadId if the client wants to resume a previously started multi-part upload.
If client sent an UploadId for resume, uses boto3.list_parts() to discover which parts have already been uploaded.
Otherwise, uses boto3.create_multipart_upload() to get a new UploadId. This is sent to the browser. To enable my resume features, I store this UploadId in the browser's local storage.
I decide how many parts there will be by dividing the file size by 5MB chunks. The client will need a presigned URL to upload each chunk. I use boto3.generate_presigned_url() to generate each URL.
So this endpoint returns the UploadId and an array of URLs.
/multipart/complete
Uses boto3.complete_multipart_upload() to finalize the upload in S3. You have to pass in an array of all the parts that includes the returned ETag from the upload response in the browser. Hint: the parts array needs objects that look like { ETag: <ETag from upload response>, PartNumber: <chunk sequence> }
The client (browser) side of this uses the array of URLs to individually upload each chunk to my S3 bucket. For this, I just use XMLHttpRequest().upload(). I got "fancy" and added "throttling" logic. I keep track of the uploads in progress and the timing for each to complete. As long as they are moving "fast", I kick off more parallel uploads--up to 8. If I see the chunks slowing down, I back off to fewer simultaneous uploads.
The resume functionality is handled by storing the UploadId and file info (name, type, size) in browser storage so I can attempt a resume later.
The result is a solution that provides:
Very fast uploading of large files
Direct upload from client to S3 bucket -- so none of your own server resources, storage, and bandwidth are required for the upload traffic.
Ability to resume downloads--picking up from the last chunk successfully uploaded
Like I said, this was VERY high-level leaving a ton of details for you to work out, but maybe this will help some of you!
@prestonlimlianjie created a beautiful code example!
https://github.com/prestonlimlianjie/aws-s3-multipart-presigned-upload
It does not include bits to support resume. However, his code is a great starting point to get MPU working. Then it's just a case of "remembering" the file info, the MPU ID, and info about previously uploaded parts (PartNumber and ETag). You can store these or you can use the aws-sdk listParts() method to have AWS tell you the info about uploaded parts.
+1.
very important and needed functionality!
example of use case:
BUT background tasks it is very tricky process nowadays. in general mobile OSes allow to run process each 15 minutes (as minimum) and it can work about 30 seconds. so process's logic should create MPU and send saying N parts during this time and so on.
it will be very helpful to have possibility to _SAVE_ current state of MPU and then use it in the next 15-minutes interval.
currently such algorithm needs fully manual implementation, but a lot of developers will like this functionality as out-of-the-box
Has anyone been able to achieve what @TroyWolf outlines in NodeJS?
Has anyone been able to achieve what @TroyWolf outlines in NodeJS?
Preston's repo has the frontend as a React mini-app which is javascript, and most front-end javascript just works in Node--sans any DOM bits. So his code provides a lot of hints to build a 100% Node version.
If building the entire solution in Node, though, I can't think why you'd need presigned URLs because you can just use your AWS API access directly in your own Node app.
I'm looking to create a pause and resume upload functionality for s3 uploads in node. Basically two buttons that would help me achieve the same. Your description of your approach is the closest thing I've seen anyone come to it, hence I'd asked.
Any pointers @TroyWolf ?
MPU (multipart upload) will be a foundation to provide a pause/resume experience. The smallest chunk size is 5MB as I recall--fact check that. You'd want to use the smallest chunk size to provide a more granular pause/resume process because when you pause, you would be throwing away progress on the current chunk in progress. When you resume, you'd pick back up after the last fully-uploaded chunk.
This official documentation should be a good starting point.
https://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
What if my file size was less than 5 mb? Would MPU be a good consideration then as well?
So all chunks have to be at least 5 MB _except_ the last chunk. So if you have a file that is only 2 MB, it would be only 1 chunk. That 1 chunk would also be the last chunk, so it's OK to be smaller than 5 MB. You would not be able to pause and resume a 1 chunk file. Your pause would effectively cancel the entire upload. Resume would start over on the single chunk.
The solution will work with small files, you just won't get any pause/resume advantage.
Is there an approach that can be used for resume/pause of upload of smaller files (say .docx) to s3?
streams, perhaps?
Asking since I’m new to Node JS, and the documentation can get a little daunting sometimes.
Disclaimer: I'm not the authority on this subject, but I did stay in a Holiday Inn Express last night. Would love to be proven wrong, but no, you won't have a way to provide pause/resume for uploading small files to s3. I mean, you might be able to add a network hiccup into a stream that technically pauses it for a tiny bit, but I assume you are talking about ability to intentionally pause an upload then resume it at a significantly later time....like 15 minutes or a day, etc.
If you think about the nature of an upload, you have 2 ends--the sender and the receiver. Normally, you control the sending side, and yes, you could pause a stream and remember where you left off. The problem is, you'd need the receiving side to also support this partial file concept. This is what MPU is essentially. It's a way that s3 can receive a file in many parts and hold onto those parts until you either send an "All done" or "Cancel".
Picking this up again,
how exactly would the pause/resume functionality be executed in a node JS environment? Would it be using timeouts/intervals?
For loops can't exactly be paused right?
We use the terms "pause" and "resume", but it's not really a pause. You can't actually pause an upload stream, wait some indeterminate amount of time, then resume the stream.
MPU stands for _multi-part_ upload. These individual parts are what provide _breakpoints_ in your upload of a large file. You can't pause any one part, but you can "pause" your upload between any of the parts. AWS (and others) will remember those uploaded parts while it waits for you to resume the MPU--to upload more parts.
Since the minimum size for a part is 5 MB, you can achieve 5 MB breakpoints in your upload of a file. The exception to that 5 MB minimum is the last part can be smaller. For files smaller than 5 MB, the "last part" is the entire file. Since your only breakpoints are between parts, when you only have 1 part, you have no ability to "pause" the upload--each individual part is either uploaded 100% or not at all.
Which is what I'm referring to, how would one achieve the virtual 'pause' between the parts of 5 MB size?
The short and not helpful answer is "any way you want"! To answer the question, I have to ask _why would you want to pause an upload_? If your goal is to upload the file, the assumption is you don't want to pause--you want all parts to upload as fast as possible. I have to assume the reason most people ever "pause" an upload is because of something out of their control:
In all those scenarios, an error is probably thrown.
Presumably, if you have a 98 MB file that you have broken into 19 5 MB parts plus a final 3 MB part, your strategy involves a loop--you iterate over a queue of parts uploading them sequentially or perhaps with a few in parallel at a time. (Remember that you can upload the parts in any order and in parallel to maximize bandwidth.) When the error occurs, it will naturally break you out of your loop--effectively _pausing_ your upload. (Whatever parts were in flight are as if they never started.)
So how do you "pause"? You just stop uploading! Break out of your loop--whatever.
If you really want to introduce some arbitrary _sleep_, this is achievable in Node. Here are a couple examples of how to achieve this:
https://gist.github.com/daliborgogic/7ee40bcff586ae08b33bf929172d61e8
https://flaviocopes.com/javascript-sleep/
☕
I'm basically trying to achieve a scenario where my webpage has a progress bar for the upload and the user is given options to pause and/or resume the upload.
You have been talking about uploading the file in Node.js. Now, you mention a UI--a user uploading a file in the browser. Something does not add up here. If you are uploading the file in Node--this is server-side. (Node does not run in the browser.) Presumably, then, the entire file is already on your server so it can break the file into parts and perform an MPU to AWS (or elsewhere). If that is true, then what is the point of a progress bar of parts in the browser?
I'm not saying it's easy or intuitive, but if the file starts with a user in a browser, I encourage you to develop an MPU solution that has your user uploading the parts straight into AWS using pre-signed URLs--remove your server from the equation for the actual upload of parts. You'll save a ton of CPU and bandwidth and your uploads will be faster for your users. I have developed exactly this for a couple of clients. Files are securely uploaded straight from the browser to an S3 bucket using MPU. Progress meters and all.
For these clients, I did not create a "pause" button because nobody has wanted that, but if the user simply closes their browser, they effectively "pause" the upload. You can't automatically start resuming the upload because you can't open files for a user in the browser--the user has to select the file for upload. When they come back to the app, they can choose the same file for upload, and it picks back up at the parts that have not been uploaded previously.
It would be relatively easy to "remember" the files that started uploading but did not finish. With this, you could prompt the user next time they enter the app that their file upload has not completed and to please re-select the file for upload....so the upload can resume.
Only other strategy I can imagine seems a bit ludicrous, but I guess one _could_ create their own MPU service (yikes) to upload parts from the browser to your own server and have your server upload those same parts immediately via MPU. Your progress bar would have to consider the progress of the parts both from browser to your server and from your server to AWS? Don't go down that path. 😀
Another question, how would you go about retrying the MPU upload in node JS?
Is it just dependent on the Upload ID and resuming from the last part, or does it need something else as well?
There is nothing distinct about a retry vs the initial upload of a part. You can upload the same part 10 times if you want to. Honey badger don't care. AWS only cares that you upload all the parts eventually then send the complete request. If, in fact, all the parts are there and you've done everything right, AWS will assemble those parts into a single file and it will appear in your S3 bucket.
The minimum parameters required to upload a part incldue:
As each part is uploaded, AWS responds with an ETag. You have to hold onto these ETags--you'll need them to complete the MPU. See the CompleteMultipartUpload documentation for details.
The aws s3 javascript for node documentation will help you of course.
Here are 2 developers offering their code as examples for you--I'm sure there are many more to be found:
https://gist.github.com/sevastos/5804803
https://gist.github.com/magegu/ea94cca4a40a764af487
Finally, I want to remind you and others about the mpu "gotcha". The parts you upload are hidden in your bucket--you can't see them in the GUI or using the API to list a bucket's contents. Therefore, you always need to either Complete or Abort an mpu to prevent orphaned parts from taking up space and thus storage costs.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html
As a fail-safe, you probably want to create a rule on your s3 bucket that will eventually cleanup orphaned mpu parts.
https://aws.amazon.com/blogs/aws/s3-lifecycle-management-update-support-for-multipart-uploads-and-delete-markers/
☕
But how would it know start from that very part?
You mention breaking out of the loop for "pausing" the upload
When I try to upload again, I have to start another MPU right? Do I start that one with the upload ID as the previous MPU?
(sorry for the repeated line of questioning, I'm new to NodeJS in general and this concept is new to me as well)
Of course I can't teach you Node and how to generally use the AWS javascript SDK via a Github issue! But....to answer your question...
When you CreateMultipartUpload, you'll get an uploadId. You have to save that--keep track of it for the life of your mpu--whether that is one session, or several sessions across multiple days. LocalStorage in the browser? A browser cookie? Stored in your database? All options you get to choose for your solution.
As you upload each part via UploadPart, you'll get an ETag. You have to save that because you'll need it to complete the mpu after all parts uploaded.
Alternatively, if you don't have the ETags for your uploaded parts but you do know your uploadId, you can use the ListParts command to get a list of the parts already uploaded. However, AWS says you should not rely on this command to know your uploaded parts. Instead they encourage you to keep track of them yourself. In my own experience, ListParts works just fine for this.
You asked, "But how would it know start from that very part?". Your question should be, "How do I know which parts still need to be uploaded?" Remember, the order in which you upload the parts does not matter. You'll split your file into parts and assign each one a partNumber--that order matters, but the order in which you upload them does not. You can upload part 17 then part 3 followed by part 64 and so on. You just need to eventually upload all the parts and complete the MPU or abort it to clean up orphaned parts.
You know which parts still need to be uploaded because you know which parts successfully uploaded because as you uploaded each part, you received an ETag and kept track of it. Alternatively, you called ListParts for the uploadId and AWS told you which parts it already has. That response is an array that includes an object for each part with properties that tell you the partNumber and ETag.
So the process to resume an MPU is only different in that you don't need to call CreateMultipartUpload again--you already know your uploadId.
For a better understanding -- once I break out of the MPU loop, I have the part ETags and the Upload ID right
To continue the upload (from the 'pause'), does one just start another loop with just the uploadPart function from the part to be uploaded next, providing the Upload ID? Is there another parameter to be sent this time around?
Basically when I try to resume the upload for the remaining parts using the same Upload ID, it logs only those parts as uploaded and completes the upload. But on the server, only a 15 MB file exists (if 3 parts were remaining), so it downloads as a corrupted file
You won't see any file on the "server"--the s3 bucket--until all parts have been uploaded and you send the mpu complete command so AWS can reassemble the parts in partNumber order.
If the resulting file is corrupted it is probably because you got the part numbers wrong during upload or messed up when you split the file into chunks--leaving out a bit between each chunk or perhaps some overlap.
Honestly, the questions you are asking suggest you are far away from your goal. I encourage you to take a working code example such as this one and play with it--upload some files. Add some console logging to understand the parts of this code. Then adapt it for your specific needs. This code example from @sevastos is less than 100 lines!
Most helpful comment
A very high-level overview of the solution I implemented.
I created these upload API end-points. My API is Python, but you can translate this to your AWS SDK of choice.
/urlUses
boto.generate_presigned_url()to return the URL the client will use to directly upload a file to my S3 bucket./multipart/createThis endpoint takes the file info as input: name, type, size. It also takes an optional UploadId if the client wants to resume a previously started multi-part upload.
If client sent an UploadId for resume, uses
boto3.list_parts()to discover which parts have already been uploaded.Otherwise, uses
boto3.create_multipart_upload()to get a new UploadId. This is sent to the browser. To enable my resume features, I store this UploadId in the browser's local storage.I decide how many parts there will be by dividing the file size by 5MB chunks. The client will need a presigned URL to upload each chunk. I use
boto3.generate_presigned_url()to generate each URL.So this endpoint returns the UploadId and an array of URLs.
/multipart/completeUses
boto3.complete_multipart_upload()to finalize the upload in S3. You have to pass in an array of all the parts that includes the returned ETag from the upload response in the browser. Hint: the parts array needs objects that look like{ ETag: <ETag from upload response>, PartNumber: <chunk sequence> }The client (browser) side of this uses the array of URLs to individually upload each chunk to my S3 bucket. For this, I just use
XMLHttpRequest().upload(). I got "fancy" and added "throttling" logic. I keep track of the uploads in progress and the timing for each to complete. As long as they are moving "fast", I kick off more parallel uploads--up to 8. If I see the chunks slowing down, I back off to fewer simultaneous uploads.The resume functionality is handled by storing the UploadId and file info (name, type, size) in browser storage so I can attempt a resume later.
The result is a solution that provides:
Very fast uploading of large files
Direct upload from client to S3 bucket -- so none of your own server resources, storage, and bandwidth are required for the upload traffic.
Ability to resume downloads--picking up from the last chunk successfully uploaded
Like I said, this was VERY high-level leaving a ton of details for you to work out, but maybe this will help some of you!