Azure-docs: Need documentation on how to resume failed downloads with Azcopy v10

Created on 9 Dec 2018  路  26Comments  路  Source: MicrosoftDocs/azure-docs

What's the correct way to resume a failed download? I'm trying to download a 200gb VHD, it always fails at some point. I try to resume using 'azcopy jobs resume [jobId] --source-sas="[sasToken]"', and I get the following (very unhelpful) message in the job's log:

[P#0-T#0] has worker 177 which is processing TRANSFER
ERR: [P#0-T#0] DOWNLOADFAILED: url: 000 : Blob already exists
Dst: c:///localpath/vhdfilename.vhd
JobID=[jobId], Part#=0, TransfersDone=1 of 1
all parts of Job [jobId] successfully completed, cancelled or paused

The message is unhelpful because:

  • Yes I know the blob already exists, I'm trying to resume its download, so what would be abnormal is if the blob did not already exist
  • Which of the following events actually succceeded: completed, cancelled or paused? Why did azcopy decide to cancel or pause instead of resuming as I requested?
Pri3 assigned-to-author doc-enhancement storagsvc triaged

All 26 comments

Hello @ckarras

Thank you for your feedback! We have assigned this issue to content team to review further and take the right course of action.

@artemuwka + @seguler - Could you review this request for additional information after receiving these errors?

Issue still persist?

Hi @ckarras! Thanks for the feedback. Which version of AzCopyV10 are you using? We've made a few enhancements to upload/download experience which were recently released and I wanted to make sure you're using the latest version - https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#latest-preview-version-v10.

Due to the volume of issues in our queue, our policy is to close issues that have been inactive for the last 90 days. If there is anything we can help with please don't hesitate to reach out. Thank you!

@artemuwka Please reopen, the issue may have been open for 90 days but it got a reply only after 84 days. I did not have a chance yet to retry downloading a 200gb file with the latest version and see if I had the same issue. Also, the documentation still doesn't explain what's the correct way to resume failed downloads

@artemuwka I downloaded the latest preview version (10.0.8) and tried this command:

azcopy.exe cp "http://storageaccountname.blob.core.windows.net/containername/blobname?sastoken" c:localpath.vhd

When I came back to my computer this morning, I see the command has "completed" but there is no indication if it succeeded or failed. Instead, it shows several pages of debug logs and stack traces, for example:

goroutine 177016 [select]:
net/http.(persistConn).writeLoop(0xc047fc0c60)
/usr/local/go/src/net/http/transport.go:1822 +0x152
created by net/http.(
Transport).dialConn
/usr/local/go/src/net/http/transport.go:1238 +0x986

goroutine 176897 [select]:
net/http.(persistConn).writeLoop(0xc10219fd40)
/usr/local/go/src/net/http/transport.go:1822 +0x152
created by net/http.(
Transport).dialConn
/usr/local/go/src/net/http/transport.go:1238 +0x986

goroutine 175906 [select]:
net/http.(persistConn).writeLoop(0xc103070d80)
/usr/local/go/src/net/http/transport.go:1822 +0x152
created by net/http.(
Transport).dialConn
/usr/local/go/src/net/http/transport.go:1238 +0x986

goroutine 177539 [select]:
github.com/Azure/azure-storage-azcopy/common.(chunkedFileWriter).setupProgressMonitoring.func1(0xc0429e4090, 0xf, 0x4010000000000000, 0xc04816f1a0, 0x2a7e, 0x800000, 0xc042a20080, 0x31, 0x159e800000, 0xc1859c9424, ...)
/go/src/github.com/Azure/azure-storage-azcopy/common/chunkedFileWriter.go:353 +0x155
created by github.com/Azure/azure-storage-azcopy/common.(
chunkedFileWriter).setupProgressMonitoring
/go/src/github.com/Azure/azure-storage-azcopy/common/chunkedFileWriter.go:348 +0x192

It also produced a 85mb log file, but the end of the log file doesn't either show what's that status of the download (100% completed? aborted? if aborted, is it resumable?)

I then tried resuming the download by executing exactly the same command, but I see it created a new job (different job ID, different log file).

Questions:

  • How can I know if a copy operation succeeded?
  • If a file was not downloaded completely, how can I resume the download?

Thanks

Hi @ckarras, thanks for the feedbacks.

Normally a job summary is printed after all the transfers conclude. You could try uploading a small file to see the normal behavior.

Did you put the computer to sleep when you left? Are those stack traces printed to the command line? Or did you copy them out of the log files?

Resume command can be invoked with ./azcopy resume [job-id]. But retries are done at the file level, meaning that a half-way done transfer would get restarted from the beginning, so it's not very useful for your scenario.

I assume you are downloading a page blob, right? Since the suffix is .vhd. Unfortunately the scalability of page blobs is not as good as block blobs. Please try lowering the concurrency (environment variable) AZCOPY_CONCURRENCY_VALUE, if you see error 503s in the logs.

Hope this helps.

To verify if the concurrency override is set properly, you could run ./azcopy env.

@zezha-msft I did various things to make it fail on purpose (disconnect network cable, put computer to sleep/hibernate), etc, since my objective was to test what happens when a large file download fails and I attempt to resume it. I already tried previously to lower the concurrency level, but it just caused the download to fail further in the download, forcing me to re-download 120gb instead of only 50gb. For my scenario, I would need something that works at a file level, I guess I'll have to implement something myself using the HTTP API for Blob Storage and range headers for now.

For azcopy, this highlights 3 problems:

  1. Resume at the file level is not supported. This seems like an essential feature, it's surprising that it's not already available and I would be really surprised if I was the first person in the world who's trying to get a copy of a VHD file from blob storage. (In my case it's to archive the VHD to an offline disk)

Is there already a feature request for this?

  1. The documentation should clarify what is supported or not supported, especially for cases a user could reasonably expect to be available.

  2. In the case of a failure, the status of a job is not clear in the logs or the console output

Thanks

I saw one article in Azure storage team blog, they had mentioned that if we are uploading 100GB of page blob and if transfer is interrupted after 30GB then you don't have to transfer from scratch. it will automatically start from 31 Gb

I believe articles are disabled hence I cannot see it.

https://blogs.msdn.microsoft.com/windowsazurestorage/2013/09/07/azcopy-transfer-data-with-re-startable-mode-and-sas-token/

Hi @ckarras, thanks a lot for your feedbacks!

We originally did have file-level resumes, but the dominant concern was that files/blobs would be left in a corrupt state if the job fails or gets cancelled. I'll bring up your feedback with the Team to see if we should reconsider this.

We will improve the documentation on this as well. @seguler for visibility.

Thanks @zezha-msft for the information. Note however that, with the current implementation, files/blobs are still left in a corrupt state if the job fails or get cancelled, but there's nothing the user can do about it except restart the whole download. For large page blobs (around 100gb or more), my experience is that it will always fail at some point for various reasons.

Also, the whole size of the blob has already been allocated on disk so it's not obvious that the download was incomplete, and the user must calculate a MD5 hash and compare with the MD5 hash returned by the blob storage API to know if the file is corrupt or not.

Suggestions:

  • rename the file to indicate it's an incomplete download while downloading, for example with a ".partial" extension
  • store information about the progress of the download (successfully transmitted ranges) in an alternate file stream (or something similar that would work on all platforms - maybe an additional file in the same folder with this information)

Hi @zezha-msft ,
Do you perhaps have an update if the request as mentioned above has been discussed?

Hi @ckarras, the failed transfers are actually deleted.

@L-Trick, it's very unlikely that we'll add back file-level retries, due to data integrity concerns: there is no guarantee that the blob/file is still in the state AzCopy left it previously; the lack of this guarantee makes file-level retries extremely dangerous and it may end up with corrupted result.

Then what solution do you suggest to download large files such as VHD, given that transfers always fail at some point for various reasons? There needs to be a standard solution, it doesn't make sense that people have to implement their own custom reliable file download solution using the REST API and range requests.

@ckarras, why is the transfer failing though? Are you using the latest version?

There can be several reasons, for example:

  • Lost network connection
  • Azure Storage service closed connection 2 hours after the transfer started (and while it is still in progress), for reasons out of my control
  • Computer woke up from hibernate/sleep
  • etc

If file-level resumes were supported, then reason shouldn't matter, all reasons should be handled the same way.

(And I was using the latest version last time I tried, but it's pointless to retry with the latest version again if you say resumes are still not supported)

About your integrity concern: The Azure Storage REST API can return a MD5 for a blob: https://docs.microsoft.com/en-us/rest/api/storageservices/get-blob-properties (Content-MD5 header). At the end of a resumed transfer, the server could be queried to get the Content-MD5 header, and azcopy could also calculate the MD5 locally. If there's a mismatch, the user could be warned. This should be good enough when a file is not modified during the transfer including during retries.

Or an improvement over my suggestion would be to:

  • check the MD5 before starting the transfer
  • then, if there's ever a need to resume the transfer, request a new MD5 hash (for the whole blob) before resuming
  • if it is different from the original MD5, inform the user that the file has been modified and that resuming the transfer won't be possible
  • at the end of the transfer, request the MD5 hash again, and compare it against the initial MD5. If it is different warn the user that although the transfer completed, the hash has changed so some parts of the file may be corrupted

Hi @ckarras, thanks for the additional context.

AzCopy was massively improved since this issue was posted (it's now GA instead of preview), so I'd highly recommend trying the latest version, as it deals with service throttling (page blobs) in a much better way, and failures should be very unlikely.

As for the MD5, you are referring to the stored MD5 value, which must be set by the client/AzCopy(not computed by service), making it unusable for the scenario you are describing. In addition, the overall MD5 value of the page blob changes every time we do a put range call, so it initial MD5 is useless once the transfer starts.

Hi @zezha-msft,

What do you mean by "failures should be very unlikely"? The examples I provided are all due to external factors that are out of azcopy's control. Are you saying that with the latest version, I should be able to, for example:

  • Start transferring a large file using azcopy
  • During the transfer, do one of these:

    • Disconnect my network cable, long enough for the connections to timeout

    • Hibernate/sleep and then wake my computer

    • Kill one of azcopy's connections using a tool such as tcpview (from SysInternals)

  • Expected behavior:

    • Azcopy will re-establish any lost connection and continue downloading

    • If it can't re-establish a connection, it will keep retrying, even if it takes hours before network connectivity comes back

?

If not, I don't understand how you can say these types of failures should be very unlikely.

Thanks

About the MD5, since you mention a "put range call", I see you're talking about the case where azcopy is being used to write a page blob (copied from another blob or from a local file). In that case, the MD5 that should be calculated is the one from the source file, not from the destination file, which will of course change as the file is being written.

So the cases would be:

  1. Azcopy copies from a local file to a page or block blob:

    • Azcopy calculates the MD5 of the local file

    • Azcopy uploads from the local file to blob storage (writes the page blob or block blob using "put range" operations)

    • Once the transfer is completed, it requests the MD5 from the server to validate it is the same as the source file

    • If there's a need to resume, Azcopy first calculates the MD5 of the local file to make sure the file has not changed

  1. Azcopy copies from a page blob or block blob to a local file

    • Azcopy requests the MD5 of the remote blob from the blob storage service

    • Azcopy downloads the page or block blob (using "get range" operations) to a local file (writes a local file)

    • Once the transfer is completed, it calculates the MD5 of the local file to validate it is the same as the source file

    • If there's a need to resume, Azcopy first requests the MD5 of the remote blob to make sure the file has not changed

I agree with all @ckarras is saying here. It's disappointing AZCOPY does not support file resuming, and there are many reasons why a transfer could be interrupted that are external to one's environment.

I feel the solution around file integrity is as ckarras mentioned; check if the MD5 hash has changed - if not, resume the download, otherwise alert the user and restart. To simply remove resume forever is a disappointing shortcoming of the utility. For example, I have a 330GB VHD that I want to download and for it to be an all-or-nothing proposition is crazy.

reassign: @normesta

@normesta - Can you determine whether there is any doc update required here? If not, we will pass the product suggestions on to the engineering team.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

varma31 picture varma31  路  3Comments

Agazoth picture Agazoth  路  3Comments

spottedmahn picture spottedmahn  路  3Comments

JamesDLD picture JamesDLD  路  3Comments

ianpowell2017 picture ianpowell2017  路  3Comments