What's the correct way to resume a failed download? I'm trying to download a 200gb VHD, it always fails at some point. I try to resume using 'azcopy jobs resume [jobId] --source-sas="[sasToken]"', and I get the following (very unhelpful) message in the job's log:
[P#0-T#0] has worker 177 which is processing TRANSFER
ERR: [P#0-T#0] DOWNLOADFAILED: url: 000 : Blob already exists
Dst: c:///localpath/vhdfilename.vhd
JobID=[jobId], Part#=0, TransfersDone=1 of 1
all parts of Job [jobId] successfully completed, cancelled or paused
The message is unhelpful because:
Hello @ckarras
Thank you for your feedback! We have assigned this issue to content team to review further and take the right course of action.
@artemuwka + @seguler - Could you review this request for additional information after receiving these errors?
Issue still persist?
Hi @ckarras! Thanks for the feedback. Which version of AzCopyV10 are you using? We've made a few enhancements to upload/download experience which were recently released and I wanted to make sure you're using the latest version - https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#latest-preview-version-v10.
Due to the volume of issues in our queue, our policy is to close issues that have been inactive for the last 90 days. If there is anything we can help with please don't hesitate to reach out. Thank you!
@artemuwka Please reopen, the issue may have been open for 90 days but it got a reply only after 84 days. I did not have a chance yet to retry downloading a 200gb file with the latest version and see if I had the same issue. Also, the documentation still doesn't explain what's the correct way to resume failed downloads
@artemuwka I downloaded the latest preview version (10.0.8) and tried this command:
azcopy.exe cp "http://storageaccountname.blob.core.windows.net/containername/blobname?sastoken" c:localpath.vhd
When I came back to my computer this morning, I see the command has "completed" but there is no indication if it succeeded or failed. Instead, it shows several pages of debug logs and stack traces, for example:
goroutine 177016 [select]:
net/http.(persistConn).writeLoop(0xc047fc0c60)
/usr/local/go/src/net/http/transport.go:1822 +0x152
created by net/http.(Transport).dialConn
/usr/local/go/src/net/http/transport.go:1238 +0x986
goroutine 176897 [select]:
net/http.(persistConn).writeLoop(0xc10219fd40)
/usr/local/go/src/net/http/transport.go:1822 +0x152
created by net/http.(Transport).dialConn
/usr/local/go/src/net/http/transport.go:1238 +0x986
goroutine 175906 [select]:
net/http.(persistConn).writeLoop(0xc103070d80)
/usr/local/go/src/net/http/transport.go:1822 +0x152
created by net/http.(Transport).dialConn
/usr/local/go/src/net/http/transport.go:1238 +0x986
goroutine 177539 [select]:
github.com/Azure/azure-storage-azcopy/common.(chunkedFileWriter).setupProgressMonitoring.func1(0xc0429e4090, 0xf, 0x4010000000000000, 0xc04816f1a0, 0x2a7e, 0x800000, 0xc042a20080, 0x31, 0x159e800000, 0xc1859c9424, ...)
/go/src/github.com/Azure/azure-storage-azcopy/common/chunkedFileWriter.go:353 +0x155
created by github.com/Azure/azure-storage-azcopy/common.(chunkedFileWriter).setupProgressMonitoring
/go/src/github.com/Azure/azure-storage-azcopy/common/chunkedFileWriter.go:348 +0x192
It also produced a 85mb log file, but the end of the log file doesn't either show what's that status of the download (100% completed? aborted? if aborted, is it resumable?)
I then tried resuming the download by executing exactly the same command, but I see it created a new job (different job ID, different log file).
Questions:
Thanks
Hi @ckarras, thanks for the feedbacks.
Normally a job summary is printed after all the transfers conclude. You could try uploading a small file to see the normal behavior.
Did you put the computer to sleep when you left? Are those stack traces printed to the command line? Or did you copy them out of the log files?
Resume command can be invoked with ./azcopy resume [job-id]
. But retries are done at the file level, meaning that a half-way done transfer would get restarted from the beginning, so it's not very useful for your scenario.
I assume you are downloading a page blob, right? Since the suffix is .vhd
. Unfortunately the scalability of page blobs is not as good as block blobs. Please try lowering the concurrency (environment variable) AZCOPY_CONCURRENCY_VALUE, if you see error 503s in the logs.
Hope this helps.
To verify if the concurrency override is set properly, you could run ./azcopy env
.
@zezha-msft I did various things to make it fail on purpose (disconnect network cable, put computer to sleep/hibernate), etc, since my objective was to test what happens when a large file download fails and I attempt to resume it. I already tried previously to lower the concurrency level, but it just caused the download to fail further in the download, forcing me to re-download 120gb instead of only 50gb. For my scenario, I would need something that works at a file level, I guess I'll have to implement something myself using the HTTP API for Blob Storage and range headers for now.
For azcopy, this highlights 3 problems:
Is there already a feature request for this?
The documentation should clarify what is supported or not supported, especially for cases a user could reasonably expect to be available.
In the case of a failure, the status of a job is not clear in the logs or the console output
Thanks
I saw one article in Azure storage team blog, they had mentioned that if we are uploading 100GB of page blob and if transfer is interrupted after 30GB then you don't have to transfer from scratch. it will automatically start from 31 Gb
I believe articles are disabled hence I cannot see it.
Hi @ckarras, thanks a lot for your feedbacks!
We originally did have file-level resumes, but the dominant concern was that files/blobs would be left in a corrupt state if the job fails or gets cancelled. I'll bring up your feedback with the Team to see if we should reconsider this.
We will improve the documentation on this as well. @seguler for visibility.
Thanks @zezha-msft for the information. Note however that, with the current implementation, files/blobs are still left in a corrupt state if the job fails or get cancelled, but there's nothing the user can do about it except restart the whole download. For large page blobs (around 100gb or more), my experience is that it will always fail at some point for various reasons.
Also, the whole size of the blob has already been allocated on disk so it's not obvious that the download was incomplete, and the user must calculate a MD5 hash and compare with the MD5 hash returned by the blob storage API to know if the file is corrupt or not.
Suggestions:
Hi @zezha-msft ,
Do you perhaps have an update if the request as mentioned above has been discussed?
Hi @ckarras, the failed transfers are actually deleted.
@L-Trick, it's very unlikely that we'll add back file-level retries, due to data integrity concerns: there is no guarantee that the blob/file is still in the state AzCopy left it previously; the lack of this guarantee makes file-level retries extremely dangerous and it may end up with corrupted result.
Then what solution do you suggest to download large files such as VHD, given that transfers always fail at some point for various reasons? There needs to be a standard solution, it doesn't make sense that people have to implement their own custom reliable file download solution using the REST API and range requests.
@ckarras, why is the transfer failing though? Are you using the latest version?
There can be several reasons, for example:
If file-level resumes were supported, then reason shouldn't matter, all reasons should be handled the same way.
(And I was using the latest version last time I tried, but it's pointless to retry with the latest version again if you say resumes are still not supported)
About your integrity concern: The Azure Storage REST API can return a MD5 for a blob: https://docs.microsoft.com/en-us/rest/api/storageservices/get-blob-properties (Content-MD5 header). At the end of a resumed transfer, the server could be queried to get the Content-MD5 header, and azcopy could also calculate the MD5 locally. If there's a mismatch, the user could be warned. This should be good enough when a file is not modified during the transfer including during retries.
Or an improvement over my suggestion would be to:
Hi @ckarras, thanks for the additional context.
AzCopy was massively improved since this issue was posted (it's now GA instead of preview), so I'd highly recommend trying the latest version, as it deals with service throttling (page blobs) in a much better way, and failures should be very unlikely.
As for the MD5, you are referring to the stored MD5 value, which must be set by the client/AzCopy(not computed by service), making it unusable for the scenario you are describing. In addition, the overall MD5 value of the page blob changes every time we do a put range call, so it initial MD5 is useless once the transfer starts.
Hi @zezha-msft,
What do you mean by "failures should be very unlikely"? The examples I provided are all due to external factors that are out of azcopy's control. Are you saying that with the latest version, I should be able to, for example:
?
If not, I don't understand how you can say these types of failures should be very unlikely.
Thanks
About the MD5, since you mention a "put range call", I see you're talking about the case where azcopy is being used to write a page blob (copied from another blob or from a local file). In that case, the MD5 that should be calculated is the one from the source file, not from the destination file, which will of course change as the file is being written.
So the cases would be:
I agree with all @ckarras is saying here. It's disappointing AZCOPY does not support file resuming, and there are many reasons why a transfer could be interrupted that are external to one's environment.
I feel the solution around file integrity is as ckarras mentioned; check if the MD5 hash has changed - if not, resume the download, otherwise alert the user and restart. To simply remove resume forever is a disappointing shortcoming of the utility. For example, I have a 330GB VHD that I want to download and for it to be an all-or-nothing proposition is crazy.
@normesta - Can you determine whether there is any doc update required here? If not, we will pass the product suggestions on to the engineering team.