Azure-storage-azcopy: Docs are insufficient to help users understand what to do

Created on 2 Nov 2018 · 12Comments · Source: Azure/azure-storage-azcopy

As the README.md exists today, I don't really understand the following:

when to use cp vs sync
the semantics of sync. How is the sync achieved? Is it two-way? How is integrity checked? Are timestamps used, are only missing files used?
is it possible to sync between two blob stores?

As an aside, what do I do if I do not want "job" behavior? I have things arranged such that uploads/syncs should be idempotent and hopefully shouldn't need to keep state around. Plus, I'm exclusively using this in CI scenarios where state won't be persisted between runs anyway.

documentation update

Source

colemickens

All 12 comments

Hi @colemickens, thanks for reaching out!

We appreciate your feedback, and will improve the README to include these information. In the meanwhile, you could look up help messages with: ./azcopy cp --help.

And to answer your questions:

cp is a simple transferring operation, it scans the source and attempts to transfer every single file/blob. The supported source/destination pairs are listed in the help message of the tool. On the other hand, sync makes sure that whatever is present in source will be replicated to the destination, and also whatever is not at the source will be deleted from the destination. If your goal is to simply move some files, then cp is definitely the right command.
It's a one way sync, the destination will ultimately only have whatever is on the source. We use last modified times to determine whether to transfer the same file present on both sides.
Only local <-> blob is supported. I've improved the help message, and it will be merged in shortly.

And for the final question, you can always launch new jobs with the same parameters, and they are completely separate entities that do not impact each other. The "job" behavior is mandatory since we want to allow users to resume a failed job if necessary; but the user can choose not to resume and launch the same operation again.

zezha-msft on 2 Nov 2018

Thanks a bunch @zezha-msft, this is great information.

Is there any chance the sync behavior can be customized (I can open a separate issue?)

For example, some times I get nervous and don't trust timestamps, or it is the case that all of my content is content-addressable, therefor, the sync semantics I need are "Upload Everything that doesn't exist".

Today I've implemented this by:

take a list of files in storage
loop through files on disk, for each file not in storage, create a symlink in a upload_staging dir
use az storage blob upload-batch to upload the symlinked dir of missing files.

It would be really cool to remove my dependency on az CLI and just be able to do it with a single command with this tool.

colemickens on 2 Nov 2018

Hi @colemickens, you can accomplish the "Upload Everything that doesn't exist" behavior with the copy command, by including the following flag: --overwrite=false.

zezha-msft on 3 Nov 2018

Here is an example:

./azcopy cp [src] [dst] --overwrite=false --recursive=true

zezha-msft on 3 Nov 2018

👍1

Very cool! Thank you! As far as I'm concerned we can close this. Not sure if you want to leave it open to track any other README updates.

colemickens on 3 Nov 2018

👍1

@colemickens sounds great!

I'll keep this open until #111 is merged in. A FAQ section was added thanks to your question. 😄

zezha-msft on 3 Nov 2018

🎉1

Hi. Can someone please explain how azcopy does the integrity check when uploading from local to Azure storage account?

As I understand, the azcopy will do the md5 check only when "downloading" (refer from the doc). For uploading, azcopy will "calculate" the md5 and put to the content-md5.

But it's not mentioning whether the azcopy validates the integrity of the file or not? (and how?)

Thank you.

ppakawatk on 17 Feb 2020

@ppakawatk I just added a wiki page on the subject for you, here https://github.com/Azure/azure-storage-azcopy/wiki/Data-integrity-and-validation

JohnRusk on 18 Feb 2020

👍1

@ppakawatk I just added a wiki page on the subject for you, here https://github.com/Azure/azure-storage-azcopy/wiki/Data-integrity-and-validation

@JohnRusk, thank you so much for you prompt reply.
That helps me a lot to understand the process.

However, I have tried some experience and got a question.
As I understand, the "CONTENT-MD5" is calculated based on the original disk file and put into blobs.
I have tried editing data after uploaded to Azure strorage, and the "CONTENT-MD5" changed.

So after downloading using --check-md5, the process is still success.

Could you suggest if there is anything wrong with my understanding sir.

EDIT:
OK. I think I misunderstand the "edit" button. I guess that when I edited using the "edit" button, blob will be re-uploaded, so the MD5 was recalculated.

If that's the case, could you please suggest on how I can test that my program can do the integrity checking?
Right now, I can only assume that if the download (with --check-md5) finished without errors, that means integrity checking success.

ppakawatk on 18 Feb 2020

I have tried editing data after uploaded to Azure strorage, and the "CONTENT-MD5" changed.

Which data are you editing? The blob in Azure or the original source?

Also, how big is the blob?

And, what tool did you use to edit it?

JohnRusk on 18 Feb 2020

I have tried editing data after uploaded to Azure strorage, and the "CONTENT-MD5" changed.

Which data are you editing? The blob in Azure or the original source?

Also, how big is the blob?

And, what tool did you use to edit it?

Text file. I tried adding/ deleting 1 letter from the content of the file.

The size is around 100 KB.

I edited in Azure Portal.

ppakawatk on 19 Feb 2020

OK, here's what I think is happening. (I'm going partly from memory here, because I can't find all the documentation). As I recall, when a small blob is saved, it's usually saved in one single operation, with the PutBlob API call. The documentation for that call says that, if you don't provide an MD5 then it will compute a fresh one for you. So that's why it gets a fresh MD5 in your test.

But for big blobs, they have to be uploaded in several blocks. And for those, the service cannot automatically generate a new MD5 (because MD5s must be computed sequentially, but the blocks may not arrive sequentially). So for big blobs, the automatic update of the MD5, which you have observed, does not happen.

JohnRusk on 19 Feb 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings