As the README.md exists today, I don't really understand the following:
cp vs syncsync. How is the sync achieved? Is it two-way? How is integrity checked? Are timestamps used, are only missing files used?sync between two blob stores?As an aside, what do I do if I do not want "job" behavior? I have things arranged such that uploads/syncs should be idempotent and hopefully shouldn't need to keep state around. Plus, I'm exclusively using this in CI scenarios where state won't be persisted between runs anyway.
Hi @colemickens, thanks for reaching out!
We appreciate your feedback, and will improve the README to include these information. In the meanwhile, you could look up help messages with: ./azcopy cp --help.
And to answer your questions:
cp is a simple transferring operation, it scans the source and attempts to transfer every single file/blob. The supported source/destination pairs are listed in the help message of the tool. On the other hand, sync makes sure that whatever is present in source will be replicated to the destination, and also whatever is not at the source will be deleted from the destination. If your goal is to simply move some files, then cp is definitely the right command.And for the final question, you can always launch new jobs with the same parameters, and they are completely separate entities that do not impact each other. The "job" behavior is mandatory since we want to allow users to resume a failed job if necessary; but the user can choose not to resume and launch the same operation again.
Thanks a bunch @zezha-msft, this is great information.
Is there any chance the sync behavior can be customized (I can open a separate issue?)
For example, some times I get nervous and don't trust timestamps, or it is the case that all of my content is content-addressable, therefor, the sync semantics I need are "Upload Everything that doesn't exist".
Today I've implemented this by:
upload_staging diraz storage blob upload-batch to upload the symlinked dir of missing files.It would be really cool to remove my dependency on az CLI and just be able to do it with a single command with this tool.
Hi @colemickens, you can accomplish the "Upload Everything that doesn't exist" behavior with the copy command, by including the following flag: --overwrite=false.
Here is an example:
./azcopy cp [src] [dst] --overwrite=false --recursive=true
Very cool! Thank you! As far as I'm concerned we can close this. Not sure if you want to leave it open to track any other README updates.
@colemickens sounds great!
I'll keep this open until #111 is merged in. A FAQ section was added thanks to your question. 馃槃
Hi. Can someone please explain how azcopy does the integrity check when uploading from local to Azure storage account?
As I understand, the azcopy will do the md5 check only when "downloading" (refer from the doc). For uploading, azcopy will "calculate" the md5 and put to the content-md5.
But it's not mentioning whether the azcopy validates the integrity of the file or not? (and how?)
Thank you.
@ppakawatk I just added a wiki page on the subject for you, here https://github.com/Azure/azure-storage-azcopy/wiki/Data-integrity-and-validation
@ppakawatk I just added a wiki page on the subject for you, here https://github.com/Azure/azure-storage-azcopy/wiki/Data-integrity-and-validation
@JohnRusk, thank you so much for you prompt reply.
That helps me a lot to understand the process.
However, I have tried some experience and got a question.
As I understand, the "CONTENT-MD5" is calculated based on the original disk file and put into blobs.
I have tried editing data after uploaded to Azure strorage, and the "CONTENT-MD5" changed.
So after downloading using --check-md5, the process is still success.
Could you suggest if there is anything wrong with my understanding sir.
EDIT:
OK. I think I misunderstand the "edit" button. I guess that when I edited using the "edit" button, blob will be re-uploaded, so the MD5 was recalculated.
If that's the case, could you please suggest on how I can test that my program can do the integrity checking?
Right now, I can only assume that if the download (with --check-md5) finished without errors, that means integrity checking success.
I have tried editing data after uploaded to Azure strorage, and the "CONTENT-MD5" changed.
Which data are you editing? The blob in Azure or the original source?
Also, how big is the blob?
And, what tool did you use to edit it?
I have tried editing data after uploaded to Azure strorage, and the "CONTENT-MD5" changed.
Which data are you editing? The blob in Azure or the original source?
Also, how big is the blob?
And, what tool did you use to edit it?
Text file. I tried adding/ deleting 1 letter from the content of the file.
The size is around 100 KB.
I edited in Azure Portal.
OK, here's what I think is happening. (I'm going partly from memory here, because I can't find all the documentation). As I recall, when a small blob is saved, it's usually saved in one single operation, with the PutBlob API call. The documentation for that call says that, if you don't provide an MD5 then it will compute a fresh one for you. So that's why it gets a fresh MD5 in your test.
But for big blobs, they have to be uploaded in several blocks. And for those, the service cannot automatically generate a new MD5 (because MD5s must be computed sequentially, but the blocks may not arrive sequentially). So for big blobs, the automatic update of the MD5, which you have observed, does not happen.