Azure-storage-azcopy: Allow skipping files if the MD5 matches. --check-md5 for

Created on 29 Jun 2019 · 8Comments · Source: Azure/azure-storage-azcopy

Which version of the AzCopy was used?

10.1.2

Which platform are you using?

Windows

What command did you run?

azcopy sync .\dist https://<account>.blob.core.windows.net/$web --delete-destination true --put-md5

What problem was encountered?

azcopy synchronizes all files base on the LMT property even if the content MD5 has not changed. This means that for server side build solutions even if 99% of the generated output files are the same, it will still re-copy them because the LMT dates have changed. I recommend adding support to sync and copy to allow --check-md5 SkipIfNotModified or something else to request that unmodified files not be sync'd. This could also include checking if the filesize is the same to reduce the false positive rate.

How can we reproduce the problem in the simplest way?

Generate a file
azcopy sync the file somewhere
Re-generate the file with the same contents but a new Last Modified Time
azcopy sync again and notice the unchanged file is re-copied.

Have you found a mitigation/solution?

Feature Request

feature request

Source

veleek

👍14

Most helpful comment

I am running AzCopy from my laptop at home.

veleek on 1 Jul 2019

👍2 😄1

All 8 comments

Hi @veleek. Can you tell us a bit more about the context where this would be useful?

The reason I'm asking is as follows. To tell whether the MD5 has changed, we still have to read the disk file, because the only way to find the MD5 of a file on disk is to read the whole file and compute the MD5. So for instance when syncing local disk to cloud storage, we'd have to read all the local files (or at least, all the local files where the LMT had changed, which in your example is basically all the local files). Having to read all of the local files leaves us with some difficult choices:

Choice 1: Save the whole file in memory until we know whether we need to upload it. I.e. read it to the end, see what the MD5 is, then upload the data that we just read (and have retained in memory). Unfortunately, this approach does not generalize to cases where the amount of data to be processed does not fit in memory.

Choice 2: Read the file again. This is like choice 1, in that we read to the end to compute the local MD5. But in this case, to avoid the memory issues, we don't retain the data in memory. Instead, if the file needs to be transferred, we read it from disk _again_, after we have computed the MD5.

Because choice 1 doesn't generalize to all cases, I'm reluctant to incorporate it into the codebase. (Special cases add complexity and testing burden). As for choice 2, my feeling its that it might not give you the performance win that you're looking for. E.g. if your server is an Azure VM, you'll typically find that AzCopy is constrained by disk throughput, not by network throughput. And choice 2 actually makes the disk constraint _worse_ because we read all files (even those not transferred once) and then we read those that are transferred again. So we read over 100% of the data - i.e. we spend more time in disk reads that we do in the current design.

One place where choice 2 might help would be cases where its an op-premises VM (or physical machine) and the network bandwidth is lower than the disk throughput. In that case, choice 2 would help. But I can't see it helping for VMs in Azure, and I'm not sure that the on-premise case is enough to justify the cost and complexity of the feature.

Interested in your thoughts on the above, and the situations where you'd be hoping to use the feature.

JohnRusk on 1 Jul 2019

I'm not uploading gigabytes of data each time and this generally applies during development where I'm repeatedly uploading the same files with one or two minor changes and validating behavior. I'm building a static website which is hosted using the Azure Storage Static Website configuration. I run the static site generation tool which processes all of my templates and re-generates all of the output. The output folder is completely cleaned to ensure that we're only deploying files which are actually currently part of the output and not just left-overs.

Thankfully, other pass-thru content (images and other media) are not generated so their modification dates don't change, but I still end up having to wait another minute or two during each deployment while every html and css file is uploaded. Since there's not a mechanism for testing this locally, pushing it up to the storage account is the only way to do so.

I understand that there's going to be SOME additional overhead and I definitely don't think that this needs to be something that's standard for every call. But I feel like there are still plenty of scenarios (e.g. thousands of very small files) where the overhead of pre-calculating an MD5 to avoid an upload would outweigh the costs of loading it into memory or re-reading the file from disk. So having it as an opt-in feature would really be a great thing.

veleek on 1 Jul 2019

@veleek Are you running AzCopy on an Azure VM?

JohnRusk on 1 Jul 2019

Or is it on-premises?

JohnRusk on 1 Jul 2019

I am running AzCopy from my laptop at home.

veleek on 1 Jul 2019

👍2 😄1

In my environment I cannot trust file system timestamps, whether it is due to user archiving and unarchiving, copying from another source, or a program mucking with timestamps, etc. Therefore the default behavior of azcopy sync looking at timestamps doesn't work for me, frustratingly.

If azcopy sync had --check-md5 SkipIfNotModified, it would be great for me only if I could get it to ignore the last modified time. So what I really want is a --check-md5 UseMD5InsteadOfModifiedTime. This coupled with sync's --delete-destination flag gives a reliable sync.

We have --put-md5 which is great, but please take MD5 support to the next level so I can have a trustworthy sync that looks at file content and not just time according to a local file system.

Agendum on 4 Mar 2020

Thanks @Agendum. As you can see from my comments above, this is actually a difficult feature to implement. For that reason, and the high number of other features in our queue, I'm not sure whether we'll do this one, and if we were to do it, I'm not sure how soon that would be. (Probably not soon, sorry).

If you have your own way of getting file Hashes (e.g. Powershell's Get-FileHash) you might be able to script your own solution for figuring out which files need to be moved. Then you could pass a list of those files to AzCopy: https://github.com/Azure/azure-storage-azcopy/wiki/Listing-specific-files-to-transfer

I realise that 's not a great solution, but it's the best workaround I can think of at the moment.

cc @veleek

JohnRusk on 9 Mar 2020

👍1

I'd like to add my support on this feature request. This isn't an issue at all when working in Azure but is usually an issue when bringing data from on-premises to Azure. I'm running into a similar problem where I have some corrupted files that were extracted that I need to re-transfer about 170 GB (1.1 million files) from my laptop to Azure Storage. Allowing a user to optionally choose an option which may be slower on Azure with a warning of the slowdown would be useful in cases where the network is the bottleneck.

I understand the limitation of resources for development, can we put this on a low priority queue rather than dropping it completely? I think this option will be quite useful to new users of Azure who are migrating data onto the system.